Infrared and Harsh Light Visible Image Fusion Using an Environmental Light Perception Network

The complementary combination of emphasizing target objects in infrared images and rich texture details in visible images can effectively enhance the information entropy of fused images, thereby providing substantial assistance for downstream composite high-level vision tasks, such as nighttime vehicle intelligent driving. However, mainstream fusion algorithms lack specific research on the contradiction between the low information entropy and high pixel intensity of visible images under harsh light nighttime road environments. As a result, fusion algorithms that perform well in normal conditions can only produce low information entropy fusion images similar to the information distribution of visible images under harsh light interference. In response to these problems, we designed an image fusion network resilient to harsh light environment interference, incorporating entropy and information theory principles to enhance robustness and information retention. Specifically, an edge feature extraction module was designed to extract key edge features of salient targets to optimize fusion information entropy. Additionally, a harsh light environment aware (HLEA) module was proposed to avoid the decrease in fusion image quality caused by the contradiction between low information entropy and high pixel intensity based on the information distribution characteristics of harsh light visible images. Finally, an edge-guided hierarchical fusion (EGHF) module was designed to achieve robust feature fusion, minimizing irrelevant noise entropy and maximizing useful information entropy. Extensive experiments demonstrate that, compared to other advanced algorithms, the method proposed fusion results contain more useful information and have significant advantages in high-level vision tasks under harsh nighttime lighting conditions.


Introduction
Images from different modalities obtained by infrared and visible sensors describe and interpret the imaging scene from different perspectives.The imaging results of visible sensors have abundant texture details and excellent visual effects, but they are sensitive to light intensity, while infrared sensors can ignore illumination variations to produce images with prominent targets, though with blurred details and susceptibility to noise.By integrating the key information of both modalities guided by entropy metrics and information theory, the fusion results maximize useful information and minimize noise, greatly enhancing the information entropy and completeness of the scene description in the fused images.Thus, the fusion of infrared and visible images, which is able to enhance scene visual effects and provide substantial assistance for subsequent high-level tasks, has been widely applied in areas such as intelligent vehicle assistance driving [1], road safety monitoring [2], and modern satellite remote sensing [3].
In visible-infrared image fusion research, numerous approaches have continually emerged.Successive improvement of methods has increased the effectiveness of fusion and Entropy 2024, 26, 696 2 of 32 demonstrated in high-level vision tasks.The methodologies for fusing infrared and visible images can be delineated into traditional methods [4] and deep learning approaches [5].Traditional techniques are specifically categorized as methods based on sparse dictionary training [6], methods based on multi-scale transformations [7], methods based on lowrank representation learning [8], methods based on segmentation or saliency [9], and hybrid modeling methods [10].In recent years, generalized neural network structures with good training results on huge datasets have been widely used for their excellent feature extraction capabilities that can be easily transformed into current fusion tasks to achieve superior fusion performance in contrast to traditional fusion algorithms.Deep learning methods based on deep learning are mainly categorized according to differences in network structures, such as autoencoder (AE) networks [11], generative adversarial networks (GANs) [12], convolutional neural networks (CNNs) [13], and transformers [14].
Although the current fusion results of deep neural networks are superior to traditional methods and overcome some drawbacks of traditional methods, such as the lack of targeted research on the distribution of visible image information under harsh light interference in scenarios like night roads, deep learning-based methods still struggle to achieve satisfactory results in both harsh light background and normal environments simultaneously.Specifically, first and foremost, most existing fusion methods are predicated on an antecedent assumption that the intensity of infrared image target pixels is prominent, and the gradients of visible images are replete with texture details, selectively extracting features from visible and infrared images.However, this assumption overlooks the importance of target boundary information in high-level vision tasks, such as object detection and semantic segmentation, leading to blurred target boundaries in fusion results, which impinge upon the efficacy of the fusion outcomes.Secondly, while mainstream advanced fusion methods can achieve good results in favorable environments, the contradiction between low information entropy and high pixel values in visible images under harsh light interference conditions results in significant target information being obscured by invalid harsh light information in the fused image, hindering the attainment of satisfactory fusion results.Finally, the majority of deep learning-based approaches leverage convolutional layers to extract a diverse array of features from images and subsequently feed them into the next level after rough concatenation.However, this approach overlooks the differences in feature levels, thereby diminishing the utilization of feature information.Consequently, during the fusion process, important features such as the target boundaries required for high-level vision tasks are partially lost, impinging upon the efficacy of fusion outcomes in ensuing high-level vision tasks.These challenges result in unique image fusion difficulties under nighttime harsh light conditions.
We designed a harsh light interference-resistant target edge information-guided feature fusion network to address the above issues.This network explores the potential of combining image information entropy and information distribution to guide fusion tasks, achieving high-performance fused images in both normal and harsh light environments, thus improving the outcomes of subsequent high-level vision tasks.The specific details of the network structure will be presented in Section 3, and the primary contributions of this manuscript are delineated as follows: • An infrared salient target edge feature extraction module that includes high-level semantic interactive fusion (HSIF) and edge feature enhance block (EFEB) is designed, permitting the network to more precisely attend to salient semantic information at the edges, thus avoiding the blurring of the structure and edge details of salient targets in the fusion results.• A harsh light environment aware (HLEA) module is designed.Based on the distribution of harsh light visible light image information, this module determines the feature fusion weights of the hierarchical feature fusion module to address the severe degradation of fusion performance under adverse lighting conditions due to low information entropy and high-intensity pixel value areas.
• An edge-guided hierarchical feature fusion (EGHF) module is designed.Combining target edge feature guidance with harsh light aware-weight allocation achieves efficient utilization of key semantic features under harsh light interference, avoiding the impact of blurred target boundaries and harsh light occlusion on ensuing high-level vision tasks in fusion results.
The residual portion of this manuscript is organized according to the following structure.In Section 2, relevant image fusion methods and key-related work are briefly described.Section 3 provides an in-depth discussion of our proposed fusion approach, covering aspects such as the network architecture, module design, and loss functions.In the subsequent section, we primarily elucidate the advantages of the fused images produced by our proposed network over those of other methods with regard to qualitative visual perception and quantitative nonparametric parameters under regular environments and challenging nighttime conditions with harsh light.In the final chapter of this paper, Section 5, we summarize the content of our research.

Related Work
In this chapter, we commence by surveying extant algorithms for the fusion of infrared and visible images.We then offer a concise overview of the relevant foundational knowledge involved in the harsh light environment-aware network proposed in this paper.

Traditional Image Fusion Methods
Traditional image fusion methodologies typically encompass three steps: image transformation, coefficient amalgamation, and image inverse transformation.In image fusion research, the most dynamic domain is centered on multi-scale transformation methods [15].Multi-scale decomposition-based image fusion breaks down images into different scale representations, such as pyramids [16], wavelet transforms [17], etc.The early fusion approaches rooted in pyramid transformation were proposed by Burt et al. [18], which continuously filter the source image, with large scales at the bottom, gradually decreasing upwards, to obtain a structure similar to a pyramid, and after fusion, the final image with source image information is obtained, but the decomposition process is irreversible.In contrast, methods based on wavelet transforms can relatively easily capture both the structural features and intricate details from the source image and can effectively retain information throughout the decomposition process.Jose et al. [19] proposed an NSST multi-scale image fusion method that inherits the wavelet transform concept, decomposing images into subbands of different scales in multiple directions to capture features at various directional scales.This process avoids downsampling, thereby preventing resolution loss and better preserving image details.However, it faces challenges in expressing image boundaries and feature information.To address edge information retention, Shreyamsha et al. [20] introduced a fusion method called CBF.This method employs cross-bilateral filtering between source images to maintain the edge features of bimodal images while suppressing noise and other unnecessary details.Ma et al. [9] developed a gradient transfer method for image fusion.By computing gradients, extracting salient gradients, and integrating Poisson gradient fields, this method reconstructs fused images to retain more details and edge information.Zuo et al. [21] used filters that preserve edges, like guided filtering and bilateral filtering, to enrich the texture information of fusion images.In addition, Vaish et al. [22] used multi-resolution singular value decomposition (MSVD) technology to identify important and unimportant details, and the fundamental information was fused based on the absolute maximum value rule for critical information while employing sparse representation for less essential details, which is suitable for maintaining the intricate image details and edge delineation.Li et al. [23] designed a model using adaptive sparse representation.By learning a more compact dictionary to replace the initially redundant dictionary, pseudo shadows and distortions in the fusion image were effectively reduced while preserving image details and structures.Zhou et al. [24] presented a methodology for fusing infrared and visible images that employ the recognition of prominent objects, which guides the network to detect salient regions by introducing salient object masks, implicitly realizing salient object detection and key information fusion.While these traditional methods are generally simple and straightforward, their effectiveness is severely limited when tackling complex image fusion tasks.

Deep Learning Methods
A PCNN [25] is a pivotal aspect earliest proposed deep learning-based image fusion method, which utilizes deep learning models to acquire the feature representations and fusion rules of images through large-scale data training, enabling the learning of higherlevel feature expressions from data and achieving end-to-end image fusion.Recently, deep learning-based advancements in image fusion methods have undergone rapid development, and according to the features and guiding rules of the algorithms, they can be classified into four distinct categories: approaches utilizing AE, methods based on GANs, methods based on CNNs, and transformer-based techniques.
Autoencoder-based methods typically involve three steps: first, pre-training the autoencoder with source images to extract their features; then, combining various source image features using fusion methods; and finally, reconstructing the images using the trained decoder.Prabhakar et al. [26] introduced an innovative unsupervised deep learning fusion architecture, named DeepFuse, designed for fusing static multiple-exposure images.However, the computation only utilized the output of the last layer, resulting in information loss.Addressing this, Li et al. [27] replaced the convolutions in DeepFuse with 3 × 3 convolutional kernels and improved the convolutional network by replacing it with a Dense-Block, proposing the groundbreaking DenseFuse network architecture.Building on the success of the DenseFuse dense connection structure, Li et al. [28] proposed NestFuse, which uses nested connections instead of the original dense connections to constrain the impact of semantic gaps.Additionally, NestFuse includes attention mechanism-based fusion rules to retain more effective information.Zhao et al. [29] were the first to propose an image fusion network where both decomposition and fusion are completed using AEs.The encoder decomposes the image into background feature maps and detail feature maps, which are then fused separately before being restored to the original image by the decoder.Leveraging the powerful feature extraction and representation capabilities of autoencoders, the generated fusion images exhibit significant improvements in detail and information content.
CNN-based methods offer unparalleled advantages in image feature extraction.The fundamental concept behind CNN-based methods is to train a CNN model to generate a fused outcome by training to discern the mapping relationship between input images.Li et al. [30] utilized a framework for deep learning to extract attributes from visible as well as infrared images.However, they encountered issues with the VGG-19 network's structural deficiencies, resulting in the loss of certain valuable information during feature extraction.Addressing this, Li et al. [31] introduced a fusion architecture that utilizes zero-phase components along with deep features analysis as its foundation.Yang et al. [32] introduced a method for fusing visible and infrared images using latent low-rank representation in conjunction with convolutional neural networks, avoiding issues such as information loss, insufficient imaging quality, and complex structures.Xu et al. [33] introduced a method called DRF, which employs convolutional networks to perform disentangled representations for image fusion.The method decomposes the image information sources according to imaging principles, extracts scene and sensor modality feature representations via convolutional networks, and then fuses these representations using various strategies.Finally, the fused image is generated through a pre-trained generator.This method significantly improves the accuracy and interpretability of the fusion process.
GAN-based methods are an artificial intelligence algorithm used for generative modeling, consisting of a discriminator and a generator.The generator's duty lies in producing synthetic images, whereas the discriminator assesses the genuineness or authenticity of those images.During training, the discriminator and generator engage in adversarial training, competing against each other, aiming to make the generated images as realistic as possible.Ma et al. [12] innovatively applied GANs to the infrared and visible image fusion method FusionGAN.Unlike traditional methods that depend on manually designed fusion rules, FusionGAN uses adversarial training to automatically learn how to fuse images, significantly reducing reliance on manual expertise.However, FusionGAN's dependence on feedback solely from the discriminator may lead to issues with model convergence or unstable fusion image quality.GANMcC [34], which is based on FusionGAN, improves training stability by introducing multiple constraints, reducing the difficulty of convergence, and allowing the model to produce more consistent and reliable fusion results.Recently, GAN-based fusion methods have attracted more attention.Li et al. [35] introduced an unsupervised multi-domain image translation approach using structural consistency generative adversarial networks (SCGAN) to achieve fusion and transformation between images.Xu et al. [36] discussed the use of GANs for image fusion, including applications such as multi-modal image fusion and image restoration.
Transformer is an architecture that significantly improves the performance of deep learning translation models by using attention mechanisms.In 2021, Zhao et al. [37] applied a transformer to the image fusion discipline.Wang et al. [14] used a transformer-based encoder to extract global features and designed SwinFuse in an autoencoder-based framework.
Compared with traditional methods, deep learning could learn by extracting informative features from an extensive training dataset and has made significant strides in visual recognition, but it may reduce the quality of fusion under poor light interference, and the need for high-level vision tasks may be ignored.

Image Fusion Resilient to Interference
Some researchers have recognized that infrared and visible image fusion algorithms may drastically degrade the output quality under challenging adverse environmental conditions.Therefore, scholars have conducted targeted research on the interferences that may exist in fusion application environments, hoping to enhance the robustness of fusion networks and significantly bolster the performance of fusion results.Low-light scene image fusion initially received widespread attention [4].Due to the low contrast and limited texture and color information fusion of visible images in low-light conditions, the visual effects of the fused images are unsatisfactory.Zhang et al. [38] designed an EV-Fusion network, which achieves the fusion of visible image information with infrared images through two stages of enhancing and fusing images, thus enhancing the image fusion effect in low-light environments.The EV-Fusion network considers the interference of low-light environments but neglects the interference of illumination changes on the fusion results.Considering this issue, Tang et al. [39] designed a sub-network for daytime and nighttime classification in PIAFusion.By calculating the weighting of the visible and infrared components in the loss function based on classification probabilities, differentiated network parameter updates were achieved for different illumination environments, thereby mitigating the influence of illumination changes on the outcomes of fusion.Furthermore, besides considering the interference of different illuminations, many researchers also consider uniformly considering all interferences to enhance the resistance of fusion networks to all possible disturbances that may degrade sensor imaging results.Luo et al. [40] introduced a fusion rule grounded on the richness of image information, aiming to preserve more meaningful content from input images, prevent highly prominent but contextually unimportant information in visible from overshadowing key information in infrared images, and achieve a hierarchical presentation of key information in images.Unfortunately, manually designed fusion rules are overly complex and poorly adapted to various actual interferences.Rao et al. [41] proposed an AT-GAN with an adverse environment estimation module.Using the evaluation results, the weights of the generator's content loss function were automatically modified to complete network training.This approach enhances fusion effects in a more flexible and balanced manner, mitigating, to some extent, the impact of severe degradation on fusion outcomes under challenging conditions.However, the authors did not provide targeted ablation experiments in the article to showcase the efficacy of this design.
The above fusion networks designed to resist various interferences often fail to achieve high-quality fusion in harsh light environments due to the failure of the fusion rules.While some fusion methods have considered disturbances from different illuminations, such as low-light environment interference and illumination environment transformation interference, the absence of targeted research on harsh light environment interference also prohibits the achievement of fusion networks resistant to harsh light interference.Therefore, in our work, we propose a harsh light environment-aware module to achieve high-quality image fusion under challenging harsh light conditions.Specifically, we first estimate the illumination conditions of the environment and extract an illumination map using illumination classification and illumination decomposition sub-networks.We then determine the feature fusion weights in network fusion modules by the illumination probability and illumination map output by the sub-network and evaluate the quality level of the input images.This approach leads to the network incorporating more semantic information, rather than simple salient information, into the generated fused results, enhancing the scene interpretability and object recognizability, thus avoiding interference from irrelevant harsh light information.

Proposed Method
In this section, we initially present the overall structure of our network model.Then, we provide detailed explanations of the three core components of the network model: edge feature extraction, harsh light environment perception, and hierarchical feature fusion.Finally, we expound on the loss function used for network training.

Overview of Network Architecture
To accomplish image fusion under harsh light interference, avoiding the occlusion of thermal target information by harsh light while improving the fusion efficacy results in subsequent high-level vision tasks, we designed a novel network for visible and infrared image fusion.The specific framework of the fusion network is depicted in Figure 1.The network backbone consists of four parts: multi-modal feature extraction, edge feature extraction, harsh light environment awareness, and image decoding and reconstruction.
Entropy 2024, 26, x FOR PEER REVIEW 6 of 33 some extent, the impact of severe degradation on fusion outcomes under challenging conditions.However, the authors did not provide targeted ablation experiments in the article to showcase the efficacy of this design.
The above fusion networks designed to resist various interferences often fail to achieve high-quality fusion in harsh light environments due to the failure of the fusion rules.While some fusion methods have considered disturbances from different illuminations, such as lowlight environment interference and illumination environment transformation interference, the absence of targeted research on harsh light environment interference also prohibits the achievement of fusion networks resistant to harsh light interference.Therefore, in our work, we propose a harsh light environment-aware module to achieve high-quality image fusion under challenging harsh light conditions.Specifically, we first estimate the illumination conditions of the environment and extract an illumination map using illumination classification and illumination decomposition sub-networks.We then determine the feature fusion weights in network fusion modules by the illumination probability and illumination map output by the sub-network and evaluate the quality level of the input images.This approach leads to the network incorporating more semantic information, rather than simple salient information, into the generated fused results, enhancing the scene interpretability and object recognizability, thus avoiding interference from irrelevant harsh light information.

Proposed Method
In this section, we initially present the overall structure of our network model.Then, we provide detailed explanations of the three core components of the network model: edge feature extraction, harsh light environment perception, and hierarchical feature fusion.Finally, we expound on the loss function used for network training.

Overview of Network Architecture
To accomplish image fusion under harsh light interference, avoiding the occlusion of thermal target information by harsh light while improving the fusion efficacy results in subsequent high-level vision tasks, we designed a novel network for visible and infrared image fusion.The specific framework of the fusion network is depicted in Figure 1.The network backbone consists of four parts: multi-modal feature extraction, edge feature extraction, harsh light environment awareness, and image decoding and reconstruction.(1) Edge Feature Extraction: Our model utilizes ResNet50 [42] pre-trained on the Ima-geNet dataset [43] for hierarchical preliminary feature extraction of infrared images.After removing the last part of the pooling layer and fully connected layer in the ResNet50 network, the remaining five convolutional groups are stacked as the backbone feature extractor.After passing through the backbone feature extractor, the features of five different levels output by the five convolutional groups are used as the preliminary feature representation of the infrared image.Additionally, the feature set extracted by the convolutional groups here is denoted as E = {E 1 , E 2 , E 3 , E 4 , E 5 }.The features representing high-level semantic features, E 4 and E 5 , are fused through HSIF to obtain semantic features representing the salient targets of the image, and then through the EFEB, different levels of shallow texture features, E 3 , E 2 , and E 1 , are progressively guided through hierarchical forward feeding to obtain three different levels of salient target edge features.
(2) Harsh Light Environment Awareness: We designed a HLEA module to cope with background harsh light interference.First, the pre-trained illuminance classification subnetwork and illuminance decomposition sub-network are used to predict the illuminance probability and illuminance map of visible images, and the infrared-visible feature weights are calculated by combining the two.Then, the weights are corrected using the Blind-Referenceless Image Spatial Quality Evaluator (BRI-SQUE) [44] to obtain the harsh light environment-aware weights.Here, the set of weights calculated by the HLEA is denoted as W = {W ir , W vi }.
(3) Image Feature Extraction: We devised a dual-branch encoding structure for feature extraction from source images.In each branch, three gradient residual dense blocks (GRDBs) [45] are employed as the main components to extract fine-grained and deep features from the input images.Here, the feature sets obtained by the dual-branch extraction for visible and infrared images are represented as vi , respectively.In the process of merging features extracted from visible and infrared sources, instead of simple concatenation fusion, inspired by the self-attention mechanism, we used edge features E i to guide fusion features, enabling the embedding of prominent target edge features into the fusion features for subsequent transmission.
(4) Image reconstruction: For the features derived from the dual-branch encoder in visible and infrared fusion, we use a decoder to complete the image reconstruction.The decoder is a single-branch structure, consisting of five convolutional layers, with the first four utilizing the Relu activation function, and the final convolutional layer employing the Tanh activation function.

Edge Feature Extraction
Image fusion, serving high-level vision tasks, should aim to incorporate more semantic information into the fusion image to enhance subsequent tasks' performance.However, as shown earlier [26,27,35], most fusion methods tend to focus on improving fusion result metrics, overlooking the importance of semantic information in the results for high-level vision.Specifically, the longstanding prior assumption that infrared images emphasize pixel intensity of significant objects while visible images have clear target texture details often leads to the neglect of edge texture features of infrared salient objects, which are crucial for high-level tasks.To address this, we have devised an edge feature extraction method tailored to infrared images that combines semantic and texture layer refinement, effectively extracting edge information of significant targets.

High-Level Semantic Feature Interactive Fusion (HSIF)
Residual networks trained on extensive image datasets have been proven to have excellent feature extraction capabilities for images [46].Therefore, we use ResNet50 to extract features from infrared images, obtaining features at different levels, where surface-level characteristics encapsulate intricate details about target edge textures, and deep features embody richer semantic content about target positions.For two consecutive deep features, their differences are much smaller than those between features from different hierarchical levels, indicating strong contextual semantic information.Therefore, we proposed the high-level semantic feature interactive fusion (HSIF) module to obtain semantic information about target positions in infrared images.The module's detailed structure is depicted in Figure 2.
Entropy 2024, 26, x FOR PEER REVIEW 8 of 33 differences are much smaller than those between features from different hierarchical levels, indicating strong contextual semantic information.Therefore, we proposed the high-level semantic feature interactive fusion (HSIF) module to obtain semantic information about target positions in infrared images.The module's detailed structure is depicted in Figure 2. In HSIF, the input features  ∈ ℝ × × and  ∈ ℝ × × , where  ,  ,  represent the dimensions of feature  , including the number of channels, height, and width, respectively, and  = 2 ,  = 2 .First, we upsample feature  to obtain feature  with the same size as  for subsequent spatial weight interaction.Then, we use attention methods to process  and  in the channel dimension; emphasis is placed on the pivotal feature channels.The specific process is described as follows: where, (•) represents the upsampling operation, obtaining the feature  ∈ ℝ × × .Moreover, "⊗" is element-wise multiplication, "⊕" is element-wise addition, (•) is global max pooling, (•) is global average pooling, (•) is a mul- tilayer perceptron, and (•) is the sigmoid normalization calculation.The obtained feature  ∈ ℝ × × has the same scale as the input feature.Subsequently, in spatial attention processing, based on the similarity between highlevel semantic features, we interactively process the feature information of  and  to obtain high-level features that have rich semantic information.Specifically, the spatial attention weights of features  and  are combined with neighboring features to achieve information interaction.The process is defined as follows: where (•) is a 3 × 3 convolution operation, and (•) is channel-wise concatenation operation.The processed feature  ∈ ℝ × × represents the feature after spatial attention processing,  ∈ ℝ × × represents the feature after information interaction, In HSIF, the input features , where C i , H i , W i represent the dimensions of feature E i , including the number of channels, height, and width, respectively, and H 4 = 2H 5 , W 4 = 2W 5 .First, we upsample feature E 5 to obtain feature E 5 ′ with the same size as E 4 for subsequent spatial weight interaction.Then, we use attention methods to process E 4 and E 5 ′ in the channel dimension; emphasis is placed on the pivotal feature channels.The specific process is described as follows: where, UP(•) represents the upsampling operation, obtaining the feature Moreover, "⊗" is element-wise multiplication, "⊕" is element-wise addition, GMP(•) is global max pooling, GAP(•) is global average pooling, MLP(•) is a multilayer perceptron, and σ(•) is the sigmoid normalization calculation.The obtained feature E c i ∈ R C 5 ×H 4 ×W 4 has the same scale as the input feature.
Subsequently, in spatial attention processing, based on the similarity between highlevel semantic features, we interactively process the feature information of E 4 and E 5 to obtain high-level features that have rich semantic information.Specifically, the spatial attention weights of features E 4 and E 5 are combined with neighboring features to achieve information interaction.The process is defined as follows: where Conv(•) 3 is a 3 × 3 convolution operation, and Cat(•) is channel-wise concatenation operation.The processed feature E s i ∈ R C i ×H 4 ×W 4 represents the feature after spatial attention processing, E h i ∈ R C i ×H 4 ×W 4 represents the feature after information interaction, and E c−s ∈ R C 4 ×H 4 ×W 4 represents the prominent target position semantic feature obtained from HSIF, which can provide assistance for the extraction of texture edge features.

Edge Feature Enhance Block (EFEB)
The high-level features obtained after information interaction contain rich global semantic information, which can represent the contextual position relationship of prominent targets.However, how to better utilize semantic information to combine with shallow features to extract target edge features is still crucial.In previous fusion methods [24,36], high-level features are usually directly fed into shallow features at various levels to enrich the semantic information at different levels.However, this operation ignores the significant differences between different-level features, and the diversity between the features at different levels causes the deep features to not play their own important role.To address this issue, we designed the EFEB to pass cross-semantic information layer by layer to shallow features, reducing the problem of partial loss of edge features caused by feature differences.The module's precise configuration is illustrated in Figure 3.
Entropy 2024, 26, x FOR PEER REVIEW 9 of 33 and  ∈ ℝ × × represents the prominent target position semantic feature obtained from HSIF, which can provide assistance for the extraction of texture edge features.

Edge Feature Enhance Block (EFEB)
The high-level features obtained after information interaction contain rich global semantic information, which can represent the contextual position relationship of prominent targets.However, how to better utilize semantic information to combine with shallow features to extract target edge features is still crucial.In previous fusion methods [24] [36], high-level features are usually directly fed into shallow features at various levels to enrich the semantic information at different levels.However, this operation ignores the significant differences between different-level features, and the diversity between the features at different levels causes the deep features to not play their own important role.To address this issue, we designed the EFEB to pass cross-semantic information layer by layer to shallow features, reducing the problem of partial loss of edge features caused by feature differences.The module's precise configuration is illustrated in Figure 3.In EFEB, the input deep semantic features are initially upsampled to the same size as the shallow texture edge features.After extracting differential features by element-wise multiplication with the input shallow texture features and concatenating them with the two inputs, the Sobel operator with residual structure is applied to strengthen edges, resulting in texture edge features of significant targets at different levels of the image.The specific implementation processes of the first and second EFEB modules are as follows: where (•) denotes a 1 × 1 convolution operation, (•) represents convolutional, normalization, and activation function layers, with convolutional layers being 3 × 3 convolutions, and s(•) is the Sobel operator. ∈ ℝ × × represents the input deep semantic features, which are also the output of the previous EFEB module. ′ ∈ ℝ × × stands for the upsampled deep semantic features,  ∈ ℝ × × represents ∈ R C 3 ×H 3 ×W 3 , while the first and second EFEBs take edge texture features E 2 ∈ R C 2 ×H 2 ×W 2 and E 1 ∈ R C 1 ×H 1 ×W 1 as inputs respectively, and the deep semantic features are the output features In EFEB, the input deep semantic features are initially upsampled to the same size as the shallow texture edge features.After extracting differential features by element-wise multiplication with the input shallow texture features and concatenating them with the two inputs, the Sobel operator with residual structure is applied to strengthen edges, resulting in texture edge features of significant targets at different levels of the image.The specific implementation processes of the first and second EFEB modules are as follows: where Conv(•) 1 denotes a 1 × 1 convolution operation, CBR(•) represents convolutional, normalization, and activation function layers, with convolutional layers being 3 × 3 convolutions, and s(•) is the Sobel operator.E edge i+1 ∈ R C i+1 ×H i+1 ×W i+1 represents the input deep semantic features, which are also the output of the previous EFEB module.bining texture edges with semantic position, and E edge i ∈ R C i ×H i ×W i signifies the enhanced significant target edge features output by the module.
The detailed implementation procedure of the third EFEB module unfolds as follows: where E c−s ′ ∈ R C 4 ×H 3 ×W 3 represents the result obtained after upsampling the output of the HSIF module, and E edge 3 ∈ R C 3 ×H 3 ×W 3 denotes the resulting features of the third EFEB module.In contrast to the first and second EFEB modules, its deep features originate from the HSIF module rather than other EFEB modules.Ultimately, the three EFEB modules obtain edge semantic features at three hierarchical levels,

Harsh Light Environment Awareness
Improving the anti-interference capability of fusion networks has always been a focal point of fusion algorithm research.As mentioned earlier, some fusion methods [39][40][41] have made targeted improvements to their robustness against low-light conditions, smoke occlusion, and image degradation interference.However, there is a lack of targeted research on enhancing the algorithm's resistance to harsh light interference, which is commonly encountered in scenes like nighttime roads.This lack of research leads to a significant deterioration in the fusion image quality under harsh light interference for most fusion methods.Some algorithms that exhibit some resistance to harsh light environment interference are unable to provide a reasonable explanation for their ability to resist harsh light interference due to the lack of targeted research and do not offer corresponding ablation experiments.Consequently, the anti-interference performance of these algorithms fails to provide substantial theoretical assistance for the research of other fusion networks resistant to harsh light interference, thereby greatly restricting the study of fusion network resistance to harsh light interference and subsequent method improvements.To address this issue and achieve a high-quality fusion of images under harsh light interference, thereby providing a theoretical foundation for subsequent research on networks resistant to harsh light interference, we conducted targeted research on harsh light interference and proposed a harsh light environment aware (HLEA) module.The module's intricate configuration is illustrated in Figure 4.
HLEA comprises an illuminance classification network [39], an illuminance decomposition network [47], and BRI-SQUE.Inspired by progressive illuminance-aware fusion networks [39], the distribution of information in visible and infrared images varies under different lighting conditions.Specifically, the main information in daytime scenes can be reflected in visible images, while nighttime infrared images harbor a wealth of pertinent information.Therefore, we utilize the illuminance classification network to estimate the probability of visible images being classified as daytime or nighttime categories, enabling fusion results under different scenes to selectively incorporate more pertinent information from either infrared or visible images.The specific implementation procedure unfolds as follows: where P d and P n represent the probabilities of the scene belonging to daytime or nighttime, respectively, N IC is the illuminance classification network, and I vi is the input visible image.
resistance to harsh light interference and subsequent method improvements.To address this issue and achieve a high-quality fusion of images under harsh light interference, thereby providing a theoretical foundation for subsequent research on networks resistant to harsh light interference, we conducted targeted research on harsh light interference and proposed a harsh light environment aware (HLEA) module.The module's intricate configuration is illustrated in Figure 4. Solely relying on the N IC to compute perceptual weights has its limitations.Under harsh light conditions, the overall brightness of visible images is high, leading to an increased bias in prediction probabilities.To address this issue, we introduce the illuminance decomposition network.This network disassembles the input image into an illuminance map and detail map, where the illuminance map effectively reflects the brightness distribution of the visible image.Given the harsh light background, the prediction of P d by N IC tends to be higher.At this time, the illuminance map also exhibits a higher average value for the first 50% of pixels.Therefore, we use the pixel mean of the illuminance map to correct the perceptual weights calculated from the predictions of the N IC .Consequently, the corrected perceptual weight of infrared features increases, while that of visible features decreases.This enhances the fusion of infrared thermal target information and reduces the fusion of meaningless harsh light information in visible images, thereby enhancing the effectiveness of image fusion under harsh light conditions.The specific process is as follows: where L represents the illuminance map of the visible image, N ID is the illuminance decomposition network, L pix f and L pixl represent the grayscale mean of the first 50% and the last 50% pixels of the illuminance map, L day and L night are the daytime and nighttime correction coefficients of the illuminance map, and W c vi and W c ir denote the corrected perceptual weights of visible and infrared after illuminance map correction.
In the third part, considering the different degrees of degradation of visible and infrared images under harsh light interference, we use the image quality assessment algorithm BRI-SQUE to evaluate the quality of the source images and calculate the perceptual weights based on the evaluation quality scores.The specific calculation process is as follows: where Q vi and Q ir represent the visible and infrared image quality scores computed using the parameter-free quality assessment algorithm BRI-SQUE.A higher score reflects worse image quality, while a lower score indicates better image quality, respectively.W q vi and W q ir represent the perceptual weights of visible and infrared data computed from the image quality scores.
Finally, the perceptual weights are calculated with N IC and N ID are corrected using BRI-SQUE-calculated perceptual weights to obtain the final result.The specific correction process is as follows: where W ir and W vi are the perceptual weights of infrared as well as visible, respectively, output by the HLEA module.Notably, in this context, a is a correction coefficient used to balance the weights obtained from the illumination assessment and the weights calculated with the image quality score evaluator.In our experiments, we empirically set it to 0.2 to balance these two factors.

Edge Guided Hierarchical Fusion
After obtaining the edge features of salient objects through the edge feature extraction module, directly concatenating or element-wise adding the edge features with the infraredvisible feature channels to integrate features can easily lead to edge noise or redundant information interference, making it difficult to fully utilize the edge features.To address this issue and enhance the guiding significance of salient object edge features, we propose an edge-guided hierarchical fusion (EGHF) module to handle the interaction and fusion of infrared-visible features and salient object edge features at different levels.Instead of directly concatenating or adding the edge features to the infrared-visible features for fusion, the edge features are used as guidance information for the infrared-visible features, achieving comprehensive feature fusion.The module's detailed arrangement is presented in Figure 5. where  and  represent the visible and infrared image quality scores computed using the parameter-free quality assessment algorithm BRI-SQUE.A higher score reflects worse image quality, while a lower score indicates better image quality, respectively. and  represent the perceptual weights of visible and infrared data computed from the image quality scores.Finally, the perceptual weights are calculated with  and  are corrected using BRI-SQUE-calculated perceptual weights to obtain the final result.The specific correction process is as follows: where  and  are the perceptual weights of infrared as well as visible, respectively, output by the HLEA module.Notably, in this context, a is a correction coefficient used to balance the weights obtained from the illumination assessment and the weights calculated with the image quality score evaluator.In our experiments, we empirically set it to 0.2 to balance these two factors.

Edge Guided Hierarchical Fusion
After obtaining the edge features of salient objects through the edge feature extraction module, directly concatenating or element-wise adding the edge features with the infrared-visible feature channels to integrate features can easily lead to edge noise or redundant information interference, making it difficult to fully utilize the edge features.To address this issue and enhance the guiding significance of salient object edge features, we propose an edge-guided hierarchical fusion (EGHF) module to handle the interaction and fusion of infrared-visible features and salient object edge features at different levels.Instead of directly concatenating or adding the edge features to the infrared-visible features for fusion, the edge features are used as guidance information for the infrared-visible features, achieving comprehensive feature fusion.The module's detailed arrangement is presented in Figure 5. Inspired by the multi-head self-attention mechanism in transformers [48], which captures long-range dependencies effectively, we capture the correlation between edge features and infrared-visible features and, based on this, complete the fusion guided by edge features.Specifically, we first upsample the edge features to match the scale of the visible-infrared features, where H and W represent the dimensions of the input image, including the height and width, and the infrared and visible features at each level, respectively.Then, we obtain query matrices E q i ∈ R HW×C ′ and key matrices E k i ∈ R C ′ ×HW through linear projection by convolutional layers, where C ′ = 2C i .The specific process is as follows. )) where v 1 (•) denotes the transformation operator that converts the matrix dimension from R A 1 ×A 2 ×A 3 to R A 2 A 3 ×A 1 , and v 2 (•) represents the transformation operator that converts the matrix dimension from denotes the edge features outputted by the i-th EFEB module, E edge ′ i represents the upsampled edge features, and E q i and E k i denote the query matrices and key matrices corresponding to the edge features.
After obtaining the query matrices E q i and key matrices E k i corresponding to the edge features, establishing a correlation between the edge features and the infrared-visible features, value matrices need to be extracted from the infrared-visible features.Firstly, the fusion of infrared features and visible features is performed using the perception weights W ir and W vi obtained from the HLEA module to control the feature input ratio for anti-harsh light interference fusion.The processed perception weights for infrared and visible features are F ir ′ i ∈ R C×H×W and F vi ′ i ∈ R C×H×W , respectively.Then, the differences and common features are highlighted separately by element-wise multiplication and addition of the two modalities.Finally, after concatenating the results of multiplication and addition with the input feature channels, the fused feature F ir−vi i ∈ R HW×C ′ is obtained through the CBR(•) layer.After obtaining the infrared-visible fusion feature, a value matrix F v i ∈ R HW×C ′ is obtained through linear projection by a convolutional layer.It is noteworthy that during the fusion process of visible and infrared features at the second and third levels, the fusion features F f usion i−1 ∈ R C×H×W from the previous layer is introduced.Therefore, there are differences in feature input between the fusion modules EGHF of the second and third levels and the EGHF of the first level.The process of deriving the value matrices for the fused features at the initial level proceeds as follows: )) where F ir 1 and F vi 1 denote the infrared and visible features of the first level, while F ir ′ 1 and F vi ′ 1 represent the infrared and visible features after perception weight processing.F ir−vi 1 stands for the integrated feature after merging the infrared and visible images, and F v 1 corresponds to the value matrix of the infrared-visible features.
In the integration of infrared and visible features in the second and third levels, a distinction from the first level is the concatenation of the fusion output F f usion i−1 from the previous level after the first CBR(•) layer and before the second CBR(•) layer.The detailed process is illustrated below: where F ir ′ i and F vi ′ i , respectively, represent the infrared and visible features of the i-th level after perceptual weight processing, and F ir−vi 1 represents the integrated features of infrared and visible images at the ith level, where i takes the values of 2 and 3, corresponding to the second and third levels.
Upon obtaining the query matrices E q i , key matrices E k i , and value matrices F v i , the calculation method of edge attention-guided weight matrix values follows the self-attention mechanism in the transformer.However, this approach results in attention weights of size (HW, HW) when computing through matrix multiplication of E q i and E k i .For example, for a 640 × 480 input image, the attention weight size is (307, 200, 307, 200), which is impractical for hardware computing resources.To address this issue, inspired by the Agent attention mechanism [49], we introduce an intermediate quantity F a i to reduce computational complexity.The specific calculation process is as follows: where  and E w i ∈ R C ′ ×H×W denotes the edge feature weight matrix.Introducing intermediate variables breaks down the process of computing the weight matrix into multiple steps, transforming space complexity into time complexity, thereby avoiding the unacceptable computational complexity of hardware resources.
Upon establishing the correlation between edge features and infrared-visible features, the attention weight matrix of edge features is obtained.Using this matrix, weights are allocated to the infrared-visible features, realizing feature fusion guided by edge features.The detailed process is outlined below: where F f usion i ∈ R C×H×W represents the output fusion feature of the EGHF module at the i-th level, where i takes values of 1, 2, and 3, each representing different levels of feature fusion.

Loss Function
To enhance the network's performance regarding qualitative visual perception and quantitative non-reference estimation indicators and to obtain high-quality fusion images under harsh lighting conditions, our model selects content loss and semantic loss to form a joint loss function, which is stated in the following manner: where L con signifies the content loss, L sem signifies the semantic loss, and λ represents the semantic loss guidance level of the fusion network training.The content loss L con consists of the following two components: where L int is the pixel intensity loss, which reflects the disparity in pixel values from the network input to the final output, and L texture is the detail texture loss, which reflects the difference from the network input to the final output in the texture detail.α is the equilibrium factor.The pixel intensity loss L int is formulated as follows: where H is the image height, W is the image width, ∥ • ∥ 1 is calculated as a norm of 1, I f used is the pixel intensity value of the fused output source, I ir is the pixel intensity value of the infrared light source, and the I vis is the pixel intensity value of the visible source.
In addition, in order to highlight salient targets while preserving additional meaningful texture structure information, we define the detail texture loss L texture as follows: where ∥ • ∥ 1 represents calculated as a norm of 1, | • | represents the absolute value operation, and ∇ represents the Sobel gradient operator of the image texture.
In order to enhance the comprehensibility of the scene in the fusion result, we introduce the following semantic segmentation network and semantic loss: [45] Among them, the categories of C semantic segmentation, I sem ∈ R C×H×W and I sem_a ∈ R C×H×W are the segmentation results and auxiliary segmentation results of the segmentation network output, respectively, and L s ∈ (1, C) H×W represents the result representation obtained from the single hot-coded label.The use of semantic loss can not only segment networks but also better assist the training of fusion networks.

Experiments
In this segment, to illustrate the superiority and advancement of our proposed fusion approach, we performed experiments to compare different approaches on multiple infrared and visible datasets and module ablation experiments.First, we presented the specifics of the experimental configurations and selection of evaluation metrics.Second, we compared our proposed fusion method with nine mainstream fusion methods in fusion tasks and high-level vision detection task, including three traditional methods, namely the following: CBF [20], NSST [19], and GTF [9], two AE-based methods, namely DIDFuse [29] and NestFuse [28], two GAN-based methods, namely FusionGAN [12] and GANMcC [34], and two CNN-based methods, namely PIAFusion [39] and DRF [33].Finally, we performed ablation experiments to validate the efficacy of the module design within our proposed methodology.

Experimental Details 4.1.1. Cooperative Training Strategy
For our proposed network model, we selected the MSRS [45] dataset, which was filtered and processed from the MFNet [50] dataset, for training.In the MSRS dataset, there are a total of 1083 training data comprising pairs of visible and infrared images.To prevent overfitting due to the small sample size, we applied data augmentation methods like random cropping, color jittering, random scaling, and horizontal flipping to original training samples.Additionally, we incorporated learning rate warm-up, learning rate decay, and the Adam optimization strategy to prevent training divergence due to inappropriate learning rates.
To include richer semantic information in the fused images, we have introduced a segmentation network and added segmentation semantic loss to the fusion loss.Therefore, designing a cooperative training algorithm for the fusion and segmentation networks to coordinate the training of both tasks becomes essential.For this reason, we have devised a cooperative training strategy as shown in Algorithm 1. Update the weight parameters of fusion network N f with the Adam optimizer: ∇ N f L f usion N f ; end Utilize the fusion network to generate fused images for the subsequent segmentation network training; for q steps do Selecting d fused images I 1 f , I 2 f . . .I d f from the output of the fusion network; Update the weight parameters of segmentation network N s with the Adam optimizer: In the training process of Algorithm 1, we first complete the pre-training of the illumination classification and decomposition sub-networks in the HLEA module before moving on to the cooperative training of the fusion and segmentation networks.These two processes are relatively independent.In one phase of the joint training, we first train the fusion network to obtain fused images, then pass these images into the segmentation network to train and obtain segmentation results.These results guide the subsequent training phase of the fusion network.Through multi-phase training, the fusion network can produce fused images that achieve better segmentation results, containing richer semantic information.It is noteworthy that the guidance provided by the segmentation network to the fusion network varies at different stages.In the initial stage, the segmentation results cannot accurately reflect the richness of semantic information in the fused images, so the segmentation network should have less influence on the fusion process.As training progresses, the accuracy of the segmentation results improves, and its influence on the fusion process should gradually deepen.This is achieved by adjusting the balancing parameter λ in the fusion loss L f usion .
In Algorithm 1, N IC , N ID , N f , and N s are the illumination classification sub-network, illumination decomposition sub-network, image fusion network, and semantic segmentation network, respectively.L class , L split , L f usion , and L sem represent the illumination classification loss function, illumination decomposition loss function, fusion loss function, and semantic segmentation loss function, respectively.m denotes the current training stage, M is the maximum training stage set to M = 4, p is the maximum number of iterations for the fusion network set to p = 2700, and q represents the maximum number of iterations for the segmentation network set to q = 20000.λ = τ 1 + e −(m−1) is the variation function of the semantic loss balance parameter λ, with τ set to 3.15.As the training progresses, the accuracy of the segmentation network gradually increases, enhancing the guiding significance of the semantic loss, so λ should increase gradually.Given the non-linear improvement of network performance during training, the rate of increase of λ should slow down, which is consistent with the designed variation function of the balance parameter λ.
Furthermore, we set the texture loss coefficient to α = 10 and the auxiliary semantic loss coefficient to β = 0.1.Finally, we implemented the model on the PyTorch platform and optimized our proposed fusion algorithm using the Adam optimizer based on the designed loss functions.Additionally, all experiments were conducted using NVIDIA-RTX4080 graphics processing units (GPUs) and Intel i9-13900K central processing units (CPUs).

Fusion Evaluation Metrics
Quantitative evaluation of fusion results can be approached from the perspective of the fusion task as well as high-level vision tasks.Specifically, for the fusion task, quantitative evaluation of fused image quality usually requires the use of no-reference metrics.In our work, we used EN [51], FMI pixel [52], SD [53], Q ab f [54], MS-SSIM [55], VIF [56], and PSNR [57] to quantitatively evaluate our model and comparative methods on the TNO [58], MFNet, and M3FD [59] datasets.These metrics evaluate image quality from various perspectives.EN and FMI pixel evaluate from an information theory perspective.EN measures the information content in the fused image, and FMI pixel assesses the mutual information between the fused and original images at the pixel level.MS-SSIM and Q ab f evaluate from the perspective of structural similarity between the source and fused images.MS-SSIM measures the consistency of multi-scale structures, while Q ab f assesses the preservation of information in the fused image.SD and PSNR are evaluated from the perspective of image characteristics.SD represents the standard deviation of the image, indicating its contrast, while PSNR quantifies the difference between the fused and source images, used to assess image reconstruction quality.VIF assesses from a human perception-inspired perspective, estimating image quality based on the fidelity of perceived information, in line with human visual perception.By using a set of multifaceted no-reference metrics, we aim to ensure a robust and comprehensive evaluation of fusion performance.Additionally, in selecting no-reference evaluation metrics, we ensured that these metrics are widely recognized in the image fusion field to ensure their objectivity in assessing the quality and effectiveness of fused images.

Detection Evaluation Details
An important objective of fusion tasks is to aid subsequent high-level vision tasks.However, the non-parametric evaluation indicators of fusion image quality may not always align perfectly with performance in high-level vision tasks, which can lead to an incomplete assessment of fusion results from the perspective of fusion tasks alone.Therefore, to further expound upon the efficacy of our proposed method in aiding high-level vision tasks, particularly in environments with harsh light interference, we opted to assess the performance of our model and comparative methods in the most representative high-level vision task, namely the detection task.Specifically, we opted to use the YOLOv5 [60] object detection algorithm to assess the effectiveness of fusion images in high-level vision tasks.We trained the detection network using a dataset comprising 500 manually annotated visible images infrared and images with key category labels for people and cars from the MFNet dataset.Subsequently, we conducted object detection on 40 normal background scenes selected from the MFNet to evaluate our model's performance in normal background scenes and on 25 harsh light background scenes from the MFNet and 15 high-exposure scenes from the RoadScene dataset, totaling 40 special scenes, to evaluate our model's performance in harsh light background scenes.For the quantitative evaluation of detection results, we chose F1_score, detection accuracy at 50% confidence (AP@50), detection precision at 70% confidence (AP@70), detection precision at 90% confidence (AP@90), and average detection accuracy from 50% to 95% confidence (AP@[50:95]) as evaluation parameters to appraise the quality of detection results.

Comparative Experiments on the TNO Dataset
To validate the performance of our proposed method in general scenes, we first selected 26 different scenes from the TNO dataset, a standard dataset comprising both visible and infrared images in regular scenarios, to qualitatively and quantitatively evaluate our model and comparative methods from the perspective of fusion tasks.

Qualitative Comparison
The qualitative results of two typical image pairs from the TNO dataset are displayed in Figure 6.Notably, the red boxes in the figure are used to zoom in on key target objects and texture background regions within the images, providing a clearer perspective.
mance, with the prominent human target being disturbed by a large amount of useless noise information, blurry edges, and significant feature loss; NSST, GTF, and FusionGAN methods severely blurred the edges and texture details of humans; DIDFuse, DRF, and GANMcCmethods preserved the overall significant information of human targets but still suffered from the problem of useless information interference caused by partial loss of significant features and weakening of the prominence of human targets.Only NestFuse, PIAFusion, and our method could better preserve the texture of human targets while avoiding the loss of significant information.Unfortunately, NestFuse and PIAFusion had serious flaws in handling the texture of infrared images, as shown in the sky cloud texture area, where the texture details were completely lost.Therefore, only our method could highlight prominent infrared target details while preserving intricate texture features.In the first scene of the presentation of the qualitative results, we used red boxes to select the prominent human target area and the sky cloud texture area.Regarding the fusion results of the prominent human target area, CBF showed the worst fusion performance, with the prominent human target being disturbed by a large amount of useless noise information, blurry edges, and significant feature loss; NSST, GTF, and FusionGAN methods severely blurred the edges and texture details of humans; DIDFuse, DRF, and GANMcCmethods preserved the overall significant information of human targets but still suffered from the problem of useless information interference caused by partial loss of significant features and weakening of the prominence of human targets.Only NestFuse, PIAFusion, and our method could better preserve the texture of human targets while avoiding the loss of significant information.Unfortunately, NestFuse and PIAFusion had serious flaws in handling the texture of infrared images, as shown in the sky cloud texture area, where the texture details were completely lost.Therefore, only our method could highlight prominent infrared target details while preserving intricate texture features.
The second scenario of the qualitative presentation is a typical challenging fusion scenario attributable to the wide-ranging smoke occlusion in the visible image.In the fusion results of this scene, CBF still introduced a large amount of useless noise; the human targets in the fusion results of NestFuse and PIAFusion were almost completely lost; the saliency of human targets in the DIDFuse method significantly decreased.Among the methods that highlighted the saliency of human targets, NSST, GTF, FusionGAN, GANMcC, and DRF all showed varying degrees of edge blurring of human targets.Only our method, due to the introduction of the edge guidance module, greatly emphasized the edge features of human targets, making them well-distinguished from the smoky background.Considering the performance of the two scenes together, only our method could better preserve richer texture details while enhancing the salient information of infrared targets.

Quantitative Comparison
For a more objective assessment of our proposed method's performance on the TNO dataset, we proceeded with additional quantitative evaluations of the method results.The quantitative assessment results of 26 image pairs in the TNO dataset are depicted in Table 1.Our proposed approach exhibits superior performance in terms of EN, MS-SSIM, PSNR, FMI pixel , and Q ab f , ranking second in SD, and falling slightly behind NestFuse and PIAFusion in the VIF metric.These findings suggest that the HLEA module we designed does not entail sacrificing fusion performance in normal scenes entirely for the sake of harsh light special scenes.In summary, our approach retains more meaningful information strongly correlated with input images, thereby achieving the overall best performance qualitatively and objectively.

Comparative Experiments on the MFNet Dataset
Scenes in the TNO dataset are generally simple, with a lower number of objects included.To substantiate the efficacy of our proposed approach in more complex scenes, we further selected 25 scenes from the MFNet dataset for qualitative and quantitative evaluations of our model and alternative methods in terms of fusion tasks.Unlike the TNO dataset, where both visible and infrared spectra are represented in grayscale, the MFNet dataset consists of color images for visible.In the fusion process, initially, we transform the visible images from the RGB color space to the YCrCb color space.Then, we integrate the luminance channel Y with the infrared images.Afterward, we merge the fused image with the Cr and Cb channels of the visible images.Finally, the resultant image is reconverted to the RGB color space to derive the color fusion image.
It is worth noting that although we selected a set of widely accepted no-reference metrics in the fusion field for multi-angle image evaluation, these metrics are generally effective only for evaluating fused images under normal conditions and have limitations in evaluating fused images under harsh lighting conditions.In Figure 7, we demonstrate this phenomenon by showing fused images under normal and harsh light conditions and using the no-reference metric VIF, which is consistent with human visual perception.
From the images, it can be seen that under normal conditions, the VIF metric for fused images aligns with human subjective perception.However, under harsh light conditions, the VIF metric for fused images with harsh light interference is higher, deviating from actual perception.Therefore, we adopt different evaluation methods for fused images under normal and harsh light conditions in the MFNet.For the former, we use a combination of conventional qualitative result presentation and quantitative no-reference metric estimation, while for the latter, we use a combination of qualitative result presentation and mean opinion score (MOS) image quality evaluation.By supplementing the MOS subjective evaluation of fused images under harsh light conditions, we can more intuitively and effectively demonstrate the superior performance of our method in image fusion under harsh lighting conditions.
Entropy 2024, 26, x FOR PEER REVIEW 20 of 33 It is worth noting that although we selected a set of widely accepted no-reference metrics in the fusion field for multi-angle image evaluation, these metrics are generally effective only for evaluating fused images under normal conditions and have limitations in evaluating fused images under harsh lighting conditions.In Figure 7, we demonstrate this phenomenon by showing fused images under normal and harsh light conditions and using the no-reference metric VIF, which is consistent with human visual perception.From the images, it can be seen that under normal conditions, the VIF metric for fused images aligns with human subjective perception.However, under harsh light conditions, the VIF metric for fused images with harsh light interference is higher, deviating from actual perception.Therefore, we adopt different evaluation methods for fused images under normal and harsh light conditions in the MFNet.For the former, we use a combination of conventional qualitative result presentation and quantitative no-reference metric estimation, while for the latter, we use a combination of qualitative result presentation and mean opinion score (MOS) image quality evaluation.By supplementing the MOS subjective evaluation of fused images under harsh light conditions, we can more intuitively and effectively demonstrate the superior performance of our method in image fusion under harsh lighting conditions.

Qualitative Comparison (1) Fusion Results Display
Figure 8 illustrates the intuitive qualitative results of two representative image pairs in the MFNet dataset.Scene one depicts daytime with harsh light interference, while scene two portrays nighttime under normal background conditions.The red boxes in the images zoom in on important target objects and textured background areas.
In the qualitative visual performance of the two typical scenes in Figure 7, the results generated by our proposed method noticeably outperform those generated by other methods.This is because, compared to most advanced fusion algorithms, our method can effectively suppress harsh light in the fusion results when there is harsh light interference in the scene, while maintaining the significance of targets susceptible to thermal radiation and the richness of texture details under harsh light occlusion.The capability of our method to handle harsh light interference is crucial, as the removal of harsh light is decisive for enhancing performance in subsequent high-level vision tasks such as object detection.Furthermore, in typical road scenes, our fusion method still visibly preserves a large amount of texture details while retaining sufficient thermal radiation information.According to the qualitative results of image fusion in two typical scenes, we can classify the selected comparative methods into two groups based on their ability to remove harsh light interference.The first group includes DIDFuse, NestFuse, and PIAFusion, which cannot effectively remove harsh light information from the fusion results.For example, in scene one, the fusion results of NestFuse and PIAFusion are almost completely obscured by harsh light information, and a large amount of vehicle texture in DID-Fuse is submerged in harsh light.The second group includes CBF, NSST, GTF, Fu-sionGAN, GANMcC, and DRF, which have essentially removed harsh light from the fusion results.However, due to the lack of targeted research on harsh light removal, the In the qualitative visual performance of the two typical scenes in Figure 7, the results generated by our proposed method noticeably outperform those generated by other methods.This is because, compared to most advanced fusion algorithms, our method can effectively suppress harsh light in the fusion results when there is harsh light inter-ference in the scene, while maintaining the significance of targets susceptible to thermal radiation and the richness of texture details under harsh light occlusion.The capability of our method to handle harsh light interference is crucial, as the removal of harsh light is decisive for enhancing performance in subsequent high-level vision tasks such as object detection.Furthermore, in typical road scenes, our fusion method still visibly preserves a large amount of texture details while retaining sufficient thermal radiation information.
According to the qualitative results of image fusion in two typical scenes, we can classify the selected comparative methods into two groups based on their ability to remove harsh light interference.The first group includes DIDFuse, NestFuse, and PIAFusion, which cannot effectively remove harsh light information from the fusion results.For example, in scene one, the fusion results of NestFuse and PIAFusion are almost completely obscured by harsh light information, and a large amount of vehicle texture in DIDFuse is submerged in harsh light.The second group includes CBF, NSST, GTF, FusionGAN, GANMcC, and DRF, which have essentially removed harsh light from the fusion results.However, due to the lack of targeted research on harsh light removal, the thermal target texture details under harsh light occlusion in the fusion results are blurred to varying degrees.For example, in scene one, the vehicle texture under harsh light occlusion is most severe in methods such as NSST, GTF, FusionGAN, and DRF.Only our method, while removing harsh light interference, retains sufficient thermal radiation information and rich texture details.In typical scenes, only our method in the second group maintains the same fusion performance as NestFuse and PIAFusion, maintaining sufficient texture information and enhanced details during the fusion process, making the fused images appear more natural.In summary, our approach amalgamates the benefits of the aforementioned two methodologies, preserving rich texture details and thermal target prominence while avoiding harsh light background interference.
(2) MOS Subjective Quality Evaluate The MOS is a subjective evaluation method employed to assess the quality of images, audio, or video.It represents the overall perceived quality of the media by summarizing ratings from several observers.We opted to employ this method to assess the quality of fused images in high-light conditions.In the experiment, we gathered the mean opinion scores from 15 reviewers for 40 fused images under high exposure or bright light conditions in five dimensions: image clarity, color fidelity, light resistance, artifact noise, and overall impression.The corresponding overall MOS was then calculated.Each dimension was rated on a discrete ordinal scale, with a total of 5 points corresponding to 5 levels.Scores from 1 to 5 indicate very poor, poor, average, good, and excellent quality for that dimension, respectively.The final MOS table is presented in Table 2.
Table 2. Mean opinion scores for fused images under harsh light conditions.Each dimension is rated from 1 to 5, with higher scores signifying excellent performance in that dimension and lower scores indicating the opposite.The results in red represent the best outcome, and those that are blue indicate suboptimal results.

Image Clarity
Color In the subjective quality assessment of fused images under harsh light conditions, our method achieves optimal performance in image clarity, color fidelity, and overall impression, and suboptimal performance in light resistance and noise artifacts.Based on the average opinion scores across the five dimensions, the subjective visual performance of our method is the best among all methods.Furthermore, it is noteworthy that the best three methods for the light resistance metric are FusionGAN, our method, and GANMcC, while the worst are NestFuse and PIAFusion.This aligns with the harsh light removal performance in fused images under harsh light conditions as shown earlier, demonstrating that our method effectively mitigates the severe interference caused by harsh light for viewers.

Quantitative Comparison
To impartially illustrate the efficacy of our suggested approach in complex scenes, we carried out quantitative non-reference evaluation index testing on a dataset consisting of 25 pairs of images from MFNet, and the findings are outlined in Table 3.It is important to note that, as previously mentioned, due to the limitations of non-parametric evaluation metrics, we only perform non-parametric evaluations on fused images of normal background scenes here.In Section 4.5, "High-level vision task-driven evaluation", we quantitatively assess the performance of our model in scenarios with harsh light interference.From the results in Table 3, it is evident that in typical scenes, our approach achieves optimal performance concerning EN, MS-SSIM, FMI pixel , and Q ab f , and the second best performance in terms of SD and VIF, while PIAFusion achieves the best performance in SD and VIF, and the second best performance in EN, MS-SSIM, FMI pixel , and Q ab f .This indicates that while maintaining the same high level of fusion image visual naturalness as PIAFusion, our approach retains a greater amount of texture and edge information from the source image.In summary, our method attains the highest standard of objective evaluation, ensuring sufficient thermal radiation information while maintaining a significant amount of texture intricacies.

Comparative Experiments on the M3FD Dataset
It is commonly acknowledged that the ability of deep learning methods to generalize is also a crucial aspect of assessing the quality of their models.Given that our model underwent training on the MFNet-based MSRS dataset, to substantiate the generalization performance of our model, we further conducted qualitative and quantitative evaluation carried out on 30 pairs of images sourced from the M3FD dataset using our model and comparative methods.It is pertinent to mention that both the visible images in the M3FD and MFNet datasets are in color, so we obtained color fusion images on the M3FD dataset following the steps of image fusion in the MFNet dataset.infrared and visible images in this scene.Unfortunately, NestFuse and PIAFusion performed poorly in the smoke interference scene in scene one.In summary, considering the performance in both scenes, our method uniquely excels in enhancing the saliency of infrared targets while simultaneously preserving richer texture details.

Quantitative Comparison
We further conducted quantitative estimation experiments on fusion methods in M3FD, and the outcomes are depicted in Table 4.In the M3FD generalization dataset scene, our approach attains the highest performance in MS-SSIM and  and the runner-up best performance in VIF,  , and PSNR.This indicates that our method still retains sufficient source image texture and edge information and has good visual fidelity on non-training datasets, achieving the best overall level with good generalization performance.In the first scene of the qualitative results display, due to the occlusion of smoke in the visible image, meaningful salient targets and texture details are almost entirely concentrated in the infrared image.This poses a serious problem for algorithms that prioritize infrared image salient object pixel intensity and clear visible image target texture details as a prior assumption for image fusion.From the fusion results, the texture details in the branch areas of CBF, NSST, GTF, NestFuse, and PIAFusion are almost completely lost or obscured by smoke, and there are varying degrees of loss in the texture of branch areas in FusionGAN, GANMcC, and DRF as well.Correspondingly, the prominence of infrared targets in the fusion outcomes of these algorithms is also weakened to varying degrees.Only DIDFusion and our method can uphold the prominence of infrared thermal targets while preserving the complete texture details of the branch areas.However, compared to the fusion results of DIDFusion, our fusion outcomes exhibit a more natural appearance and demonstrate improved consistency with human visual perception.
Within the second scenario of the qualitative results display, we set two regions of interest, one is the area of human targets and clothing texture, and the other is the area of glass texture.Across the fusion outcomes produced by various methods, although CBF, GTF, DIDFuse, FusionGAN, GANMcC, and DRF all retained the basic saliency of human targets, the texture details of human clothing appeared significantly blurred.Only NSST, NestFuse, PIAFusion, and our method retained the texture details of clothing completely, but NSST sacrifices visible light image texture for the retention of infrared image clothing texture details.In NSST, the texture area behind the glass is completely lost, indicating severe loss of visible image information in the fusion process.Therefore, only NestFuse, PIAFusion, and our method excel in the preservation of the texture details of both the infrared and visible images in this scene.Unfortunately, NestFuse and PIAFusion performed poorly in the smoke interference scene in scene one.In summary, considering the performance in both scenes, our method uniquely excels in enhancing the saliency of infrared targets while simultaneously preserving richer texture details.

Quantitative Comparison
We further conducted quantitative estimation experiments on fusion methods in M3FD, and the outcomes are depicted in Table 4.In the M3FD generalization dataset scene, our approach attains the highest performance in MS-SSIM and Q ab f and the runner-up best performance in VIF, FMI pixel , and PSNR.This indicates that our method still retains sufficient source image texture and edge information and has good visual fidelity on nontraining datasets, achieving the best overall level with good generalization performance.

High-Level Vision Task-Driven Evaluation
Fusion tasks should not only improve the depth of information encapsulated in the fused image scenes but also assist in high-level vision tasks.Therefore, we evaluated qualitatively and quantitatively assessments of the fusion outcomes achieved by our method as well as other methods in the high-level vision task of target detection to demonstrate the effectiveness and superiority of our method in assisting high-level vision tasks, especially in scenes with conflicting harsh light interference.

Quantitative Comparison (1) Harsh light scene detection results
Quantitative evaluation results of detection in harsh light and high-exposure scenes are shown in Table 5.From the outcomes, it becomes apparent that in harsh light special scenes, the infrared image achieves the best level of detection accuracy in terms of F1 score and 50% confidence level.This indicates that the infrared image information is indispensable for improving the detection performance of fusion results in harsh light environments, and our method effectively utilizes the thermal radiation target information in the fusion process, achieving detection performance similar to that of the infrared image in terms of F1_Score and Ap@50 metrics.Among fusion algorithms capable of removing harsh light interference, NSST and FusionGAN can also maintain basic detection accuracy performance at low confidence levels.However, due to the loss of thermal radiation target edges and crucial textural intricacies evident in the fusion outcomes, the detection performance decreases at high confidence levels and average confidence levels.Additionally, compared to GANMcC, which can effectively remove harsh light and include basic texture details, our method shows better overall performance in detection accuracy at different confidence levels, indicating that our method contains richer semantic information in the fusion results.It is worth noting that two fusion algorithms, NestFuse and PIAFusion, which cannot remove harsh light interference, both fail to achieve satisfactory detection results in harsh light interference scenes.(2) Normal light scene detection results Quantitative evaluation results of detection in normal background are shown in Table 6.From the results, it can be seen that algorithms such as NSST, DIDFuse, FusionGAN, and GANMcC, which perform well under harsh light interference, fail to reach top-tier detection performance.However, due to our algorithm's emphasis on the richness of semantic information in fusion results and the performance of assisting advanced visual tasks, it still maintains a leading position in detection results at high confidence levels and average confidence levels.In conclusion, our proposed fusion method can effectively assist detection tasks under harsh light interference while ensuring high-level detection results in normal background environments.In the first scenario of Figure 9, PIAFusion exhibits deviation in detection bounding boxes due to overexposure interference, while among the methods capable of removing harsh light interference, GTF and FusionGAN exhibit vehicle and pedestrian omissions due to the blurriness of target edges and textures in fusion results, and DRF generates detection box deviations.Among all fusion methods, only CBF, NSST, DID-Fuse, NestFuse, GANMcC, and our method complete the target detection task comprehensively.In the more adverse lighting conditions of the second scenario in Figure 9, NestFuse fails to detect any targets due to its inability to remove harsh light interference, and CBF, NSST, and GANMcC suffer from target omissions due to the loss of semantic information during fusion.Only DIDFuse and our method complete the target detection comprehensively.However, regrettably, DIDFuse's excellent performance under harsh light comes at the cost of sacrificing normal environment detection performance.In the second nighttime scenario of Figure 10, DIDFuse's fusion result loses almost all thermal target information, resulting in severe omissions, while our result maintains good detection performance comparable to NestFuse and PIAFusion.This indicates that our HLEA module in fusion results does not blindly remove harsh light by favoring infrared images nor does it ignore its detection performance in normal scenes.In comparison to other edge extraction algorithms, our edge feature extractor is specifically tailored for the subsequent feature fusion in the fusion network, ensuring that the extracted edge features better meet the needs of the following modules.To illustrate the effectiveness and specificity of our design, we compared the edge images decoded from the edge features obtained by the final EFEB block of our feature extractor with those from four different image edge extraction algorithms.The four selected algorithms are Laplacian edge extraction [61], Canny edge extraction [62], RCF edge extraction [63], and FDoG edge extraction [64].The comparison of edge images in two typical scenarios is presented in Figure 12.In comparison to other edge extraction algorithms, our edge feature extractor is specifically tailored for the subsequent feature fusion in the fusion network, ensuring that the extracted edge features better meet the needs of the following modules.To illustrate the effectiveness and specificity of our design, we compared the edge images decoded from the edge features obtained by the final EFEB block of our feature extractor with those from four different image edge extraction algorithms.The four selected algorithms are Laplacian edge extraction [61], Canny edge extraction [62], RCF edge extraction [63], and FDoG edge extraction [64].The comparison of edge images in two typical scenarios is presented in Figure 12.From the overall edge results, our method demonstrates significant advantages in maintaining edge coherence and noise reduction clarity, although it is not as strong in edge line pixel intensity as DRF and FDoG.This result aligns perfectly with the fusion task's requirements.In the subsequent feature fusion process, we use edge features to guide the fusion, whereas the excessive noise in DRF and FDoG could seriously interfere with the accuracy of feature fusion.Furthermore, overly thick edge lines may lead to artifacts at object edges in the fused image, thus lowering the image quality.Additionally, our edge feature extractor is designed with three EFEB modules, each extracting edge fea- From the overall edge results, our method demonstrates significant advantages in maintaining edge coherence and noise reduction clarity, although it is not as strong in edge line pixel intensity as DRF and FDoG.This result aligns perfectly with the fusion task's requirements.In the subsequent feature fusion process, we use edge features to guide the fusion, whereas the excessive noise in DRF and FDoG could seriously interfere with the accuracy of feature fusion.Furthermore, overly thick edge lines may lead to artifacts at object edges in the fused image, thus lowering the image quality.Additionally, our edge feature extractor is designed with three EFEB modules, each extracting edge features at different levels, to guide feature fusion from shallow to deep layers.This specific design makes our method more suitable for subsequent fusion tasks compared to other edge extraction algorithms.

HLEA Module Effectiveness Analysis
To more effectively demonstrate the rationale and validity of the illumination classification, illumination decomposition, and image quality estimation blocks within our HLEA Figure 13 design, we three representative scenarios under both normal and harsh light conditions for detailed presentation.These scenarios allow us to demonstrate the performance of the HLEA module in practical applications and thoroughly analyze its intermediate calculations and final outcomes under varying lighting conditions.Specifically, Figure 13 presents the detailed application of the HLEA module in these scenarios, including illumination classification results, illumination decomposition effects, image quality estimation results, intermediate weighting results, and the final weighting results.The first and second rows in the figure display the daytime and nighttime scenarios under normal conditions, respectively.In these two scenarios, the illumination classification network successfully categorizes the scenes as either daytime or nighttime.On this basis, the illumination maps generated by the illumination decomposition network have little effect on the weights calculated from the classification results.However, the values of  and  behave too extremely in this situation, so correction is required.The corrected  and  values, calculated by the image quality estimator, achieve a balanced adjustment between the infrared and visible components in the final weights while maintaining the inclination of the classification and decomposition network's weight preferences.
The third row in the figure depicts a harsh light environment scenario.In this scenario, the category probabilities calculated by the illumination classification network show bias, where the weight for the visible component is too high, and the weight for the infrared component is too low.After adjustment by the illumination map, the weight of the visible component was effectively suppressed, and the weight of the infrared component was increased.Subsequently, after correction by the image quality estimator, the final perceptual weight for the infrared component became significantly higher than the visible light component, aligning with our expected results.Combining the intermediate and final results across the three typical scenarios, our designed HLEA performs as expected under normal conditions.4.6.3.Ablation Experiments The first and second rows in the figure display the daytime and nighttime scenarios under normal conditions, respectively.In these two scenarios, the illumination classification network successfully categorizes the scenes as either daytime or nighttime.On this basis, the illumination maps generated by the illumination decomposition network have little effect on the weights calculated from the classification results.However, the values of W c vi and W c ir behave too extremely in this situation, so correction is required.The corrected W q vi and W q ir values, calculated by the image quality estimator, achieve a balanced adjustment between the infrared and visible components in the final weights while maintaining the inclination of the classification and decomposition network's weight preferences.
The third row in the figure depicts a harsh light environment scenario.In this scenario, the category probabilities calculated by the illumination classification network show bias, where the weight for the visible component is too high, and the weight for the infrared component is too low.After adjustment by the illumination map, the weight of the visible component was effectively suppressed, and the weight of the infrared component was increased.Subsequently, after correction by the image quality estimator, the final perceptual weight for the infrared component became significantly higher than the visible light component, aligning with

Ablation Experiments
In our work, we have designed multiple modules that can be categorized into two parts based on their functionalities.The HSIF and EFEB modules in edge feature extraction, along with the EGHF module in feature fusion, seek to enhance edge characteristics and surface texture details of thermal radiation targets, thereby enhancing the semantic richness of the fusion results.figuration of HLEA primarily seeks to mitigate harsh light interference and achieve anti-harsh light interference fusion for the algorithm.To demonstrate the effectiveness of these two parts of module design, we conducted targeted ablation experiments on the two parts of modules.Specifically, the model that removes the modules related to rich semantic information but includes the HLEA module is denoted as M1, and the model that includes the modules related to rich semantic information but lacks HLEA module is denoted as M2.The model that removes both parts of modules is denoted as M3, and our original model is represented as M4.The quantitative and qualitative findings from the experiments are illustrated in Figure 14 and Table 7.

Conclusions
In order to solve the impact of harsh light interference on the quality of fused images in nighttime road environments, which leads to a sharp decline in fusion algorithm performance, we propose an image fusion network that effectively copes with harsh light interference.The edge feature extraction module optimizes information entropy to retain key features.The HLEA module addresses the degradation in fusion quality caused by the conflict between low information entropy and high pixel values in image-harsh regions.And the EGHF module achieves robust feature fusion by minimizing irrelevant noise entropy and maximizing useful information entropy.The experimental outcomes reveal that compared with the existing advanced fusion algorithms, the proposed algorithm has significant advantages in the quality of fusion images and the performance of subsequent advanced visual tasks, effectively solving the influence of harsh light interference on image fusion performance in the night road environment, and provides harsh technical support for composite high-level vision tasks, such as night vehicle assisted driving.
Conflicts of Interest: Author Zhenlin Lu was employed by the company Beijing Microelectronics Technology Institute.The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.Table 7.The quantitative analysis for the average value of three fusion metrics, i.e., F1_score, mAP@50, and mAP@[50:95] on MFNet and RoadScene dataset with the ablation studies of our methods.The red and blue denote the best and second value.

Conclusions
In order to solve the impact of harsh light interference on the quality of fused images in nighttime road environments, which leads to a sharp decline in fusion algorithm performance, we propose an image fusion network that effectively copes with harsh light interference.The edge feature extraction module optimizes information entropy to retain key features.The HLEA module addresses the degradation in fusion quality caused by the conflict between low information entropy and high pixel values in image-harsh regions.And the EGHF module achieves robust feature fusion by minimizing irrelevant noise entropy and maximizing useful information entropy.The experimental outcomes reveal that compared with the existing advanced fusion algorithms, the proposed algorithm has significant advantages in the quality of fusion images and the performance of subsequent advanced visual tasks, effectively solving the influence of harsh light interference on image fusion performance in the night road environment, and provides harsh technical support for composite high-level vision tasks, such as night vehicle assisted driving.

Figure 1 .
Figure 1.The architecture of the harsh light environment-aware visible and infrared image fusion network.The network backbone is composed of four components: multi-modal feature extraction, edge feature extraction, harsh light environment awareness, and image decoding and reconstruction.

Figure 1 .
Figure 1.The architecture of the harsh light environment-aware visible and infrared image fusion network.The network backbone is composed of four components: multi-modal feature extraction, edge feature extraction, harsh light environment awareness, and image decoding and reconstruction.

Figure 2 .
Figure 2. The architecture of the high-level semantic feature interactive fusion (HSIF) module.The channel attention mechanism reallocates channel weights for edge features  and  and then constructs an interactive spatial attention mechanism to obtain more comprehensive global semantic information.

Figure 2 .
Figure 2. The architecture of the high-level semantic feature interactive fusion (HSIF) module.The channel attention mechanism reallocates channel weights for edge features E 4 and E 5 and then constructs an interactive spatial attention mechanism to obtain more comprehensive global semantic information.

Figure 3 .
Figure 3.The architecture of the edge feature enhancement block (EFEB).Detail features are fused with deep features and then enhanced with the residual Sobel operator to strengthen the edges of salient objects.In the edge feature extraction module, there are three EFEB in total, where the third EFEB takes shallow edge texture features  ∈ ℝ × × and cross-deep semantic features  as inputs, outputting feature  ∈ ℝ × × , while the first and second EFEBs take edge texture features  ∈ ℝ × × and  ∈ ℝ × × as inputs respectively, and the deep semantic features are the output features  ∈ ℝ × × and  ∈ ℝ × × from the previous EFEB module.

Figure 3 .
Figure 3.The architecture of the edge feature enhancement block (EFEB).Detail features are fused with deep features and then enhanced with the residual Sobel operator to strengthen the edges of salient objects.In the edge feature extraction module, there are three EFEB in total, where the third EFEB takes shallow edge texture features E 3 ∈ R C 3 ×H 3 ×W 3 and cross-deep semantic features E c−s as inputs, outputting feature E edge 3

Figure 4 .
Figure 4.The architecture of the harsh light environment aware (HLEA) module.The module completes the prediction of the ambient light intensity level for input infrared and visible images Figure 4.The architecture of the harsh light environment aware (HLEA) module.The module completes the prediction of the ambient light intensity level for input infrared and visible images through three parts: illuminance decomposition sub-network, illuminance classification sub-network, and image quality assessment.

Figure 5 .Figure 5 .
Figure 5.The architecture of the edge-guided hierarchical fusion (EGHF) module.In the edge guidance part, the infrared and visible features are fused according to the perceptual weights and then complete the redistribution of attention weights under the guidance of edge features.In the feature Figure 5.The architecture of the edge-guided hierarchical fusion (EGHF) module.In the edge guidance part, the infrared and visible features are fused according to the perceptual weights and then complete the redistribution of attention weights under the guidance of edge features.In the feature enhancement part, the infrared and visible features are added to the features guided by edges to achieve feature enhancement.

Algorithm 1 :
Adaptive Cooperative Training Strategy Network Input: Visible image I vi and infrared image I ir Network Output: Fused image I f used Select a visible images and labels I 1 vi , L 1 vi , . . .I a vi , L a vi ; Train illumination classification network N IC with the Adam optimizer: ∇ N IC {L class (N IC )}; Select b visible images I 1 vi , I 2 vi . . .I b vi ; Train illumination decomposition network N IS by the Adam optimizer: ∇ N ID L split (N ID ) ; for m ≤ Max iterations M do for p steps do Select c visible images I 1 vi , I 2 vi . . .I c vi ; Select c infrared images I 1 ir , I 2 ir . . .I c ir ; Update the balancing parameter λ according to λ = τ 1 + e −(m−1)

Figure 6 .
Figure 6.Qualitative results for the TNO dataset images.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 6 .
Figure 6.Qualitative results for the TNO dataset images.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 7 .
Figure 7. VIF metric results for fused images in normal and harsh light conditions.The first row and the second row correspond to the fusion results in harsh light and normal environments, respectively.

Figure 7 .
Figure 7. VIF metric results for fused images in normal and harsh light conditions.The first row and the second row correspond to the fusion results in harsh light and normal environments, respectively.4.3.1.Qualitative Comparison (1) Fusion Results Display Figure 8 illustrates the intuitive qualitative results of two representative image pairs in the MFNet dataset.Scene one depicts daytime with harsh light interference, while scene two portrays nighttime under normal background conditions.The red boxes in the images zoom in on important target objects and textured background areas.Entropy 2024, 26, x FOR PEER REVIEW 21 of 33

Figure 8 .
Figure 8. Qualitative results for the MFNet dataset images.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 8 .
Figure 8. Qualitative results for the MFNet dataset images.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 9
Figure 9 displays the intuitive qualitative outcomes of two representative image pairs sourced from the M3FD dataset.The highlighted red boxes within the figure provide a closer look at crucial target objects and textured background regions present in the images.

Figure 9 .
Figure 9. Qualitative results for the M3FD dataset images.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 9 .
Figure 9. Qualitative results for the M3FD dataset images.The images in the red boxes show texture details in the original images or magnified details of important targets.

Entropy 2024 , 33 Figure 10 .
Figure 10.Qualitative results of target detection in harsh light and high exposure scenes for infrared images, visible images, and fusion images generated by various algorithms.Each pair of lines represents the detection results for one scene.

Figure 10 .
Figure 10.Qualitative results of target detection in harsh light and high exposure scenes for infrared images, visible images, and fusion images generated by various algorithms.Each pair of lines represents the detection results for one scene.

Figure 10 .
Figure 10.Qualitative results of target detection in harsh light and high exposure scenes for infrared images, visible images, and fusion images generated by various algorithms.Each pair of lines represents the detection results for one scene.

Figure 11 .
Figure 11.Qualitative results of target detection in normal light scenes for infrared images, visible images, and fusion images generated by various algorithms.Each pair of lines represents the detection results for one scene.

Figure 11 .
Figure 11.Qualitative results of target detection in normal light scenes for infrared images, visible images, and fusion images generated by various algorithms.Each pair of lines represents the detection results for one scene.

Figure 12 .
Figure 12. Results of edge feature detection using different algorithms.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 12 .
Figure 12. Results of edge feature detection using different algorithms.The images in the red boxes show texture details in the original images or magnified details of important targets.

33 Figure 13 .
Figure 13.The intermediate and final results of the HLEA module under three typical scenarios in normal and harsh light environments are displayed.The "Class Results" column shows the illumination classification results, "Intermediate Weight" presents the illumination decomposition and image quality estimation results, and "Perceptual Weight" displays the final output results.

Figure 13 .
Figure 13.The intermediate and final results of the HLEA module under three typical scenarios in normal and harsh light environments are displayed.The "Class Results" column shows the illumination classification results, "Intermediate Weight" presents the illumination decomposition and image quality estimation results, and "Perceptual Weight" displays the final output results.
results.Combining the intermediate and final results across the three typical scenarios, our designed HLEA performs as expected under normal conditions.

Figure 14 .
Figure 14.Module corresponding ablation experiments.The images in the red boxes show texture details in the original images or magnified details of important targets.

Figure 14 .
Figure 14.Module corresponding ablation experiments.The images in the red boxes show texture details in the original images or magnified details of important targets.
the features after com- the intermediate quantity obtained after global average pooling of infrared-visible features, mul(•) denotes matrix multiplication, T(•) denotes the matrix transpose operator, ϕ qa i ∈ R HW×M represents the intermediate query matrix, ϕ ak i ∈ R M×HW represents the intermediate key matrix, ϕ akv i ∈ R M×C ′ represents the inter- mediate weight matrix, v 3 (•) represents the operation of transforming matrix dimensions

Table 1 .
Quantitative results on the TNO dataset for selected comparative approaches.The results in red represent the best outcome, and those that are blue indicate suboptimal results.

Table 3 .
Quantitative results on the MFNet dataset for selected comparative approaches.The results in red represent the best outcome, and those that are blue indicate suboptimal results.

Table 4 .
Quantitative results on the M3FD dataset for selected comparative approaches.The results in red represent the best outcome, and those that are blue indicate suboptimal results.

Table 4 .
Quantitative results on the M3FD dataset for selected comparative approaches.The results in red represent the best outcome, and those that are blue indicate suboptimal results.

Table 5 .
Quantitative target detection results of infrared images, visible images, and fusion images obtained by various algorithms in scenes with harsh light backgrounds in the MFNet and RoadScene datasets.The red, blue, and green denote the best, second, and third values.

Table 6 .
Quantitative target detection results of infrared images, visible images, and fusion images obtained by various algorithms in scenes with normal backgrounds in the MFNet dataset.The red, blue, and green denote the best, second, and third values.Figures10 and 11, we provide visual detection examples in four scenarios: two in harsh light and overexposure environments and two in normal environments, demonstrating the intuitive merits of our fusion algorithm in promoting target detection.

Table 7 .
The quantitative analysis for the average value of three fusion metrics, i.e., F1_score, mAP@50, and mAP@[50:95] on MFNet and RoadScene dataset with the ablation studies of our methods.The red and blue denote the best and second value.