Next Article in Journal
Grid-Connected Active Support and Oscillation Suppression Strategy of Energy Storage System Based on Virtual Synchronous Generator
Previous Article in Journal
Hybrid Attention Mechanism Combined with U-Net for Extracting Vascular Branching Points in Intracavitary Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Wave-Cross: Balancing Thermal Saliency and Visual Detail in Infrared–Visible Image Fusion

1
School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China
2
School of Integrated Circuits and Electronics, Beijing Institute of Technology, Zhuhai 519088, China
3
Tangshan Research Institute, Beijing Institute of Technology, Tangshan 063000, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(2), 321; https://doi.org/10.3390/electronics15020321
Submission received: 20 October 2025 / Revised: 5 January 2026 / Accepted: 8 January 2026 / Published: 11 January 2026
(This article belongs to the Section Artificial Intelligence)

Abstract

Infrared and visible image fusion (IVIF) integrates the thermal saliency of infrared images (IRs) with the structural details of visible images (VIs) to produce comprehensive scene representations. Existing methods often overemphasize one modality, leading to loss of temperature readability or visual details. To address this, we propose Wave-Cross, a wavelet-based fusion framework. Using the discrete wavelet transform (DWT), IR low-frequency sub-bands encode thermal distribution, while VI high-frequency sub-bands capture textural details. Cross-attention adaptively recombines these sub-bands, suppressing modality-specific noise and balancing complementary features. Additionally, we introduce a Heat-Consistency Loss, which enforces pixel-wise thermal ordering and local energy preservation in a self-supervised manner, ensuring the fused image retains IR interpretability while enhancing VI sharpness. Experiments on the TNO, MSRS, and M3FD datasets demonstrate the effectiveness of the proposed method. Compared with state-of-the-art baselines, Wave-Cross achieves superior performance on objective metrics such as SD, AG, SCD, SF, CC, EN, NABF, and MS-SSIM yielding clearer details and more stable thermal saliency under challenging interference conditions. These results highlight the framework’s potential for practical applications in surveillance, autonomous driving, and fault diagnosis.

1. Introduction

Due to the inherent physical limitations of imaging sensors, it is challenging to directly acquire a single image that simultaneously preserves high-resolution structural details and robust thermal information of a scene. Infrared and visible image fusion (IVIF) has therefore emerged as an effective solution to this problem. Infrared sensors are highly sensitive to thermal radiation and maintain stable imaging performance under low illumination or adverse environmental conditions [1]. However, the low spatial resolution and inherent noise characteristics of infrared sensors often lead to blurred structures and reduced visual quality. In contrast, visible-light cameras can capture high-resolution textures and fine spatial details, yet their performance degrades significantly under conditions of insufficient illumination, intense glare, or nighttime scenarios [2]. By integrating the complementary advantages of both modalities, IVIF aims to generate high-quality fused images that simultaneously retain thermal saliency and rich structural details, thereby benefiting various downstream tasks such as object detection [3], surveillance [4], remote sensing [5], and visual tracking [6]. Over the past decades, a wide range of IVIF methods have been proposed. Early approaches mainly rely on traditional fusion techniques, including multi-scale transforms [7,8], sparse representation [9], saliency-based modeling [10,11], and subspace learning [12]. Despite their merits, these handcrafted methods typically suffer from limited generalization and shallow feature representations. They depend extensively on manually engineered priors, which restrict their ability to capture complex cross-modality relationships and often lead to suboptimal fusion results in challenging conditions.
With the rapid advancement of deep learning, learning-based fusion methods have significantly improved IVIF performance. Convolutional neural network (CNN)-based architectures [13,14,15] enable end-to-end fusion by extracting hierarchical deep features. Generative adversarial network (GAN)-based approaches [16,17] promote more realistic fusion by leveraging adversarial training, whereas autoencoder fusion frameworks [18,19] learn reconstruction-driven representations that combine modality information more effectively. Although these methods achieve notable improvements, deep networks generally map features into abstract high-dimensional spaces, making interpretability difficult and complicating the design of precise, modality-aware fusion strategies. Moreover, many of these models lack explicit mechanisms to handle frequency-domain information or maintain thermal consistency, which limits their ability to fully exploit the complementary characteristics of infrared and visible images.
Early frequency-domain-related IVIF methods were dominated by multi-scale decompositiontechniques such as Laplacian pyramid, discrete wavelet transform (DWT), and curvelet transform. In particular, DWT explicitly separates low-frequency sub-bands that preserve global brightness/thermal energy from high-frequency sub-bands that capture edge and detail structures [20]. Despite its interpretability and resolution-consistency, conventional DWT-based fusion strategies [21] rely on handcrafted weighting or selection rules across sub-bands before inverse reconstruction. Consequently, such fixed heuristics lack adaptability to scene diversity and struggle to reconcile cross-modal saliency differences.
Compounding this issue, many state-of-the-art methods tend to overemphasize either VI texture injection or IR saliency preservation. Texture-oriented strategies risk over-brightening and diminishing the intrinsic thermal contrast embedded in IR grayscale distributions, undermining applications that depend on accurate temperature differentials (e.g., fault diagnosis, early fire detection) [22]. Conversely, IR-focused methods often neglect fine visual structures, reducing interpretability and usability in human-centric perception tasks [23].
To address these problems, we propose a wavelet-based balanced fusion framework that explicitly leverages complementary modality priors:
DWT-based modal separation: IRs contribute salient low-frequency sub-bands, encoding thermal energy distribution; VIs provide high-frequency sub-bands rich in contours and textures. This decomposition enables modality-specific feature emphasis while maintaining resolution consistency.
Cross-scale feature recombination: The decomposed multi-frequency sub-bands from IR and VI branches interact through a bidirectional cross-attention mechanism. Infrared low-frequency queries guide visible high-frequency keys to inject structural textures selectively, while visible-to-infrared attention adaptively suppresses redundant or noisy details. The fused attention maps are further refined via scale-adaptive gating and channel reweighting, ensuring balanced energy distribution and preserving thermal saliency. This mechanism effectively recombines features across scales, achieving complementary detail enhancement without compromising the physical consistency of infrared information.
Heat-Consistency Loss: To enhance visible-light structural details while preserving the physical correctness of thermal radiation, we introduce a Heat-Consistency Loss that complements the commonly used reconstruction and perceptual objectives. This loss enforces infrared–visible fusion to maintain the monotonic rank ordering and local energy distribution inherent to infrared imagery, ensuring that the injected visible-domain textures reinforce structural clarity without distorting or suppressing true thermal cues.
Extensive experiments on TNO [24], MSRS [25], and M3FD [26] demonstrate that our method achieves superior performance across objective metrics (SD, AG, CC, SCD, EN, SF, NABF, MS-SSIM) and subjective evaluations. These results validate the effectiveness of our balanced design in harmonizing VI texture richness with IR saliency fidelity, advancing the interpretability and applicability of IVIF in real-world scenarios.

2. Related Work

2.1. Traditional and Wavelet-Based Fusion Methods

Traditional infrared and visible image fusion approaches generally follow a three-stage process: source transformation, coefficient-level fusion, and inverse transformation. Multi-scale transform methods—such as discrete wavelet transform, dual-tree complex wavelet transform, and Laplacian pyramid decomposition—aim to separate low-frequency structures and high-frequency textures for modality-specific fusion, where manually designed rules (e.g., coefficient selection, region energy, or local contrast maximization) are applied to combine sub-band information [27,28]. Beyond classical wavelet-based approaches, gradientlet- and contourlet-based decompositions [8] have also been proposed to jointly model approximate and residual layers, enabling more flexible fusion strategies aligned with the characteristics of each sub-band.
Sparse representation-based fusion methods [9] construct over-complete dictionaries from multimodal data and fuse sparse coefficients using predefined aggregation rules. Saliency-based techniques [10,11], compute saliency or activity maps to highlight infrared targets and visible edges, whereas subspace-based models [12,29] employ PCA or ICA to project multimodal images into low-dimensional latent spaces where intrinsic features are fused more effectively. A hybrid approach [30] attempt to combine the strengths of multiple classical paradigms, such as saliency-guided frequency fusion or wavelet–sparse hybrid decomposition. However, these methods rely heavily on handcrafted priors and task-specific rules, limiting their generalizability in complex imaging scenarios.

2.2. Deep Learning-Based Fusion Methods

Deep neural networks have significantly advanced IVIF by enabling data-driven feature extraction and end-to-end fusion. Early CNN-based models leveraged pretrained backbones such as VGG19 or ResNet for hierarchical feature extraction [31,32], followed by handcrafted fusion strategies for integrating deep features. More advanced architectures, including DenseFuse [13], NestFuse [33], and encoder–decoder-based methods, [24,34] introduced progressive feature extraction, multi-scale decomposition, and deep reconstruction modules to enhance representational power. Generative models have evolved from standard GAN-based fusion to conditional, multi-discriminator, and perceptually guided frameworks. Recent GAN variants incorporate contrastive constraints, cross-modal discriminators, and structural regularization to mitigate mode collapse and ensure balanced thermal–texture retention [16,35]. However, many of these models still depend on manually designed fusion rules and lack explicit mechanisms for cross-modal interaction.

2.3. Attention Mechanisms for Cross-Modality Learning

Attention mechanisms have become central in modern IVIF research due to their ability to adaptively select informative features across modalities. Recent works emphasize spatial-channel attention, frequency-domain attention, and transformer-based fusion, representing a significant shift from earlier CNN-only networks.
Spatial-channel attention modules have been refined into multi-head, selective, and context-adaptive units capable of dynamically adjusting modality contributions across scales [36]. Such designs enable more faithful integration of low-frequency thermal information and high-frequency structural details.
Frequency-domain attention has emerged as a powerful tool in multimodal and medical fusion applications, where attention weights are conditioned on wavelet or Fourier representations to emphasize salient frequency components [37]. These methods highlight the importance of aligning attention with physical properties of thermal imaging such as monotonic temperature–intensity relationships and local energy distribution, which purely spatial attention often fails to capture.
Transformers have further pushed IVIF toward global semantic reasoning and long-range cross-modal understanding. Recent transformer-based methods employ cross-view token exchange and multi-granularity complementary attention modules [38,39]. These designs provide a strong foundation for building fusion networks that are not only powerful in representation but also inherently modality-aware.
Compared to prior works, the proposed Wave-Cross framework introduces a frequency-domain, bidirectional cross-attention mechanism operating on wavelet-decomposed sub-bands and augments it with a thermal-consistency objective, enabling physically constrained, modality-specific, and scale-aware fusion in a unified transformer-style architecture.

3. Materials and Methods

Common cross-modal image fusion frameworks include single-step multi-layer network direct fusion frameworks and encoder–decoder frameworks. To ensure that the fused image can fully retain the features of the original infrared image, Wave-Cross adopts a multi-step algorithm structure of encoder–fusion module–decoder and proposes a wavelet transform-based cross-attention mechanism framework for image fusion. In the pre-trained encoding and decoding stages, encoders and decoders with the same structure but different parameters are designed for each modality image; in the subsequent fusion and decoding stages, the decoder used in pre-training is discarded, and an additional set of fusion-decoders is trained to efficiently coordinate the fusion and decoding stages. The key components involved in the entire framework include independent modality encoders, cross-modal multi-frequency band cross-attention modules, and additional multi-layer mixed feature decoders.

3.1. Overall Network Architecture

The overall structure of Wave-Cross is shown in Figure 1. The input images from infrared and visible-light modalities are processed to produce a fused image that retains the salient region information from both modalities.
In the encoder part, two independent modality encoders are used to fully encode information from different modalities without interfering with each other. The WA module (wavelet-based attention block) is employed to mix multi-modal features: first, wavelet transform is used to decompose images from different modalities into sub-bands of different scales to capture global and detailed information separately, and cross-attention mechanisms are utilized to establish associations between infrared and visible-light images at different scales. The WA module in the fusion module is dedicated to highlighting the texture details of visible-light images while preserving the structural information and edge features of infrared images. The distinction between WA-H and WA-L lies in their frequency focus: the former applies attention mechanisms to high-frequency wavelet sub-bands to extract visible-light textual details, while the latter applies attention mechanisms to low-frequency wavelet sub-bands to extract background information and brightness information from salient infrared regions. After the dual-modal features processed by the WA module are subjected to two cross-attention operations and added together, they pass through a fully connected layer, GELU activation function, and layer normalization to obtain the final single-modal fused features. Residual structures are introduced between the encoder and decoder to retain different resolution features of the original modality images from the encoder.

3.2. Encoder Architecture

Considering the significant differences in information between different modalities, it is necessary to design encoders with different parameters, that is, independent modality encoders, to extract features from different modalities. The architecture of the encoder part is shown in Figure 2, which share the architecture with the encoder in CrossFuse [39].
The shallow features obtained from the input image after convolution are subjected to a MaxPooling operation and then passed through a DenseBlock to retain more useful information. As the encoder deepens, the extracted deep features focus more on salient content. To enhance detail information and salient features, two residual structures are applied between the encoder and decoder.
The residual paths operate at fixed spatial scales and follow a one-to-one correspondence between encoder outputs and decoder inputs, similar in spirit to U-Net–style long skip connections. Let E ( k ) C k × H k × W k denote the feature map produced by the k-th encoder stage and D(k) represent the decoder feature at the corresponding scale. The residual connection is defined as:
D θ ( k ) = D ( k ) + ϕ ( E ( k ) )
where ϕ ( · ) is a 1 × 1 convolution used only to align channel dimensions when necessary. The skip paths are strictly fixed and deterministic across training and inference. These connections preserve multi-scale structural cues extracted by the encoder and stabilize gradient propagation, ensuring that texture and thermal information are retained during fusion and reconstruction.

3.3. Fusion Module

The task of the fusion module is to add visible-light details to the infrared image without destroying the relative gray level relationship in the infrared image. To this end, the WA module employs wavelet transform to decompose the dual-modal features from the encoder into different frequency bands. The WA-H module, used for the visible-light branch, extracts the main structure and global features using multiple convolution layers in the low-frequency band. In the mid- and high-frequency bands, it employs self-attention mechanisms to extract high-frequency details in the horizontal, vertical, and diagonal directions. In contrast, the WA-L module, used for the infrared branch, operates in the opposite manner. It captures global and brightness features using self-attention mechanisms in the low-frequency band and extracts high-frequency texture details using multiple convolution layers in the mid- and high-frequency bands. Finally, in the fusion module, we design multiple cross-attention mechanisms to achieve efficient multi-frequency band cross-modal fusion and output the result to the final decoder.
Here, we will detail the most important WA module in the fusion stage, as shown in Figure 3.
Two feature branches from the encoder, representing infrared and visible modalities, respectively, enter the WA-L and WA-H modules with different parameters. Within the WA module, we first apply a LayerNorm to the encoder outputs of both modalities. After that, we apply a 2D discrete wavelet transformusing the Haar wavelet to decompose the two input branches into four sub-bands. For the encoder part that provides feature maps I ( i ) B × C × H × W and I ∈ [ir, vi], we use the Haar transform to decompose them into different sub-bands for subsequent feature extraction. The Haar wavelet transform filters consist of a low-pass filter L and a high-pass filter H, which can be represented as:
L = 1 2 [ 1 , 1 ] T , H = 1 2 [ 1 , 1 ] T
After the wavelet transform, the input feature map is decomposed into four sub-bands:
D W T ( I ( i ) ) = I L L ( i ) , I L H ( i ) , I H L ( i ) , I H H ( i ) B × C × H 2 × W 2
where, N denotes the number of patches, C represents the number of channels, and H and W are the height and width of the feature map, respectively. The low-frequency sub-band I L L ( i ) retains the main structure and global information of the image, such as overall brightness, contours, and major textures. The horizontal high-frequency sub-band I L H ( i ) preserves the detail information of the image in the horizontal direction, such as horizontal edges, horizontal textures, and abrupt changes in the horizontal direction. The vertical high-frequency sub-band I H L ( i ) and the diagonal high-frequency sub-band I H H ( i ) , respectively, preserve the detail information of the image in the vertical and diagonal directions. DWT does not change the number of channels, and due to its biorthogonal property, no information is lost in the entire process, which meets our previously proposed requirement that the fusion algorithm should fully retain infrared features.
Taking the infrared branch using the WA-L module as an example, for the low-frequency sub-band I L L ( i ) , we perform patch embedding to transform it into I L L ( i ) [ B , C , N L L , h , w ] , where NLL is the number of patches of sub-band I L L ( i ) , and h = 4 and w = 4 are the height and width of each patch, respectively. The patch-embedded feature I L L ( i ) is transposed and reshaped into a contiguous token sequence I L L ( i ) [ B , N L L , d ] with d = C × h × w = 512 before entering the attention module. Then, we employ the self-attention mechanism to extract and preserve the global background features.
[ Q L L ( i ) , K L L ( i ) , V L L ( i ) ] = I L L ( i ) W q k v ( i )
X L L ( i ) = X L L ( i ) + layernorm ( soft max ( Q L L ( i ) ( K L L ( i ) ) T d ) V L L ( i ) )
X L L ( i ) = X L L ( i ) + M L P ( l a y e r n o r m ( X L L ( i ) ) )
In this equation, W q k v ( i ) is a learnable transformation matrix. Q L L ( i ) , K L L ( i ) and V L L ( i ) represent the input value of the low-frequency sub-band of a certain modality after transformation. d represents the feature dimension obtained from patch embedding. l a y e r n o r m ( · ) denotes the layer normalization operation. M L P ( · ) is a multi-layer perceptron.
The high-frequency sub-bands are combined into I H ( i ) = c a t 0 ( I L H ( i ) , I H L ( i ) , I H H ( i ) ) 3 B × C × H 2 × W 2 , which refers to concatenation along the 0th dimension. First, the high-frequency details are amplified by a convolution block containing LeakyReLU.
X H ( i ) = L e a k y Re L U ( B N ( c o n v ( I H ( i ) ) ) )
X H ( i ) = B N ( c o n v ( X H ( i ) ) )
Specifically, c o n v ( · ) denotes a 2D convolution with a kernel size of 3 × 3 , and BN stands for batch normalization. After the entire WA module operation, the output is reconstructed using the inverse wavelet transform (IWT).
X o u t p u t ( i ) = I W T ( X L L ( i ) , X H ( i ) )
The output X o u t p u t ( i ) [ B , C , H , W ] is converted into X o u t p u t ( i ) [ B , C , N , h , w ] after patch embedding, reshaped into X o u t p u t ( i ) [ B , N , d ] , and then input into the next two cross-attention mechanism mixing modules, as shown in Figure 1. Concretely, given an encoder feature map I B × C × H × W , the DWT produces four sub-bands I L L , I L H , I H L , I H H B × C × H 2 × W 2 . The low-frequency sub-band is patch-embedded into F B × N × d , while the IWT reconstructs a fused feature map F f u s e d B × C × H × W , ensuring consistent tensor shapes throughout the WA module.
In WA-L module, the high-frequency sub-bands (LH, HL, and HH) are refined using a two-layer convolutional block after wavelet decomposition. Specifically, the input channels are first expanded from C to 2C through a 3 × 3 convolution with stride 1 and padding 1, followed by batch normalization and a LeakyReLU activation to enhance local contrast and suppress noise. A second 3 × 3 convolution then reduces the channels from 2C back to C, also followed by batch normalization. This configuration, which directly corresponds to our PyTorch (version: 2.5.1) implementation, effectively preserves fine textures while stabilizing feature distributions. In the concrete implementation, C is set to 128.
For the low-frequency branch, we adopt a custom head-partitioned dot-product attention mechanism. The feature dimension dim is evenly divided into 16 partitions, and attention is computed independently within each partition by linearly projecting the input into Q, K, and V spaces and applying scaled dot-product attention with a factor of dim−1/2. The outputs are aggregated through a linear projection with a dropout rate of 0–0.1. After attention mixing and high-frequency refinement, all four sub-bands are reconstructed via inverse wavelet transform.
WA-H and WA-L share the same architecture but maintain independent parameter sets for the visible and infrared branches.
The cross-modal interaction after WA module is implemented by two sequential cross-attention stages operating on the same patch-level tokens rather than on different wavelet sub-bands. Let XIR and X V I B × N × d denote the patch-embedded infrared and visible features after the WA module.
In the first cross-attention stage, we perform bidirectional cross-attention:
Y I R V I = C A 1 ( X I R , X V I ) = layernorm ( softmax ( X I R W Q ( 1 ) ( X V I W K ( 1 ) ) Τ d ) X V I W V ( 1 ) ) Y V I I R = C A 2 ( X V I , X I R ) = layernorm ( softmax ( X V I W Q ( 2 ) ( X I R W K ( 2 ) ) Τ d ) X I R W V ( 2 ) )
where W Q ( i ) , W K ( i ) and W V ( i ) are learnable transformation matrices of CAi. The first argument of CAi is used as queries and the second as keys/values. The two outputs are then aggregated by element-wise addition to form a shared cross-modal pattern
X c r o s s = Y I R V I + Y V I I R
In the second cross-attention stage, this shared pattern is used as the query to further refine both modalities:
Z I R = C A 3 ( X c r o s s , X I R ) , Z V I = C A 4 ( X c r o s s , X V I )
and the final fused tokens are obtained by
Z o u t = Z I R + Z V I
followed by a two-layer feed-forward network with GELU activation and layer normalization:
F o u t = l a y e r n o r m ( f c ( g e l u ( f c ( Z o u t ) ) ) )
In summary of the cross-attention stage, the first stage symmetrically exchanges information between infrared and visible tokens, while the second stage uses the fused cross-modal pattern to re-attend to and jointly refine both modalities. All four cross-attention blocks operate on the same token sequence (same sub-bands after WA), and the two stages are applied sequentially.

3.4. Decoder Architecture

The decoder used in the fusion stage is shown in Figure 4. After the first and fifth convolutions, residual structures are employed to mix the features from the pre-trained encoders of the two modalities with the features from the decoding stage in order to preserve the important features of the original modalities before mixing.
As shown in Figure 4, the decoder of fusion stage receives three inputs: the fused mid-level feature i n p u t B × C × H × W and the modality-specific deep features from the infrared decoder and visible decoder. As shown in the figure, these three tensors are first summed to form (B, C, H, W), which is processed by a convolution layer (orange block) to refine the channel size C. A subsequent convolution layer reduces the channels to C/2, followed by an upsampling layer that doubles the spatial resolution. The output then passes through a three-stage convolutional block, which sequentially performs convolution ( C / 2 C / 2 ) → upsampling, convolution ( C / 2 C / 4 ) → upsampling, and a final convolution ( C / 4 C o u t ) , progressively increasing spatial resolution and reducing channel dimensionality.
During reconstruction, shallow features from both encoders—infrared shallow feature and visible shallow feature—are injected through skip connections (green and purple arrows) and concatenated with the decoder input.
The pre-training encoder stage is shown in Figure 5, where the decoder structures used in the two modalities are the same, but the parameters are different; a new decoder is used in the fusion stage to decode the mixed features. The synchronous fusion and decoding enable the output mixed image to present the significant temperature-related features from the original infrared image, as well as the fine texture features from the visible-light image.
As shown in Figure 5, the module takes as inputs the current decoder feature i n p u t B × C × H × W and the shallow encoder feature F B × C × H × W . As illustrated, the input feature first passes through a convolution layer, which reduces the channel dimension from C to C/2, followed by an upsampling layer that doubles the spatial resolution. The output is then processed by a three-stage convolutional block consisting of convolution ( C / 2 C / 2 ) → upsampling and convolution ( C / 2 C / 4 ) → upsampling. Finally, a convolution ( C / 4 o u t ) maps the fused feature to the reconstruction output. This sequence progressively increases spatial resolution while reducing channel dimensionality, producing features of size.

3.5. Loss Function

3.5.1. The Loss Function Used During Encoder Training

To independently encode features from different modalities and ensure thorough mixing, our method employs distinct loss functions during the pre-training of the encoder and the training of the fusion stage. The loss function used during encoder training consists of two parts and can be written as follows:
L E n c o d e r = I e I c + ε S S I M ( I e , I c ) , c { i r , v i }
where Ie is the reconstructed image encoded under a certain modality (infrared or visible-light), and Ic is the infrared image or visible-light image. ε is a hyperparameter.

3.5.2. The Loss Function Used During the Training of Mixing Part

Since the fused image should contain more complementary features and reduce redundant information from different modalities, a novel loss function is proposed to train our network. The formula of our loss function is given as follows.
The loss function Lall consists of two parts and can be written as follows:
L a l l = α 1 L int + α 2 L h e a t
L int = 1 H W I o u t max ( I i r , I v i )
Specifically, Lint measures the main part of the mixed image, such as illumination and contour information; the Heat-Consistency Loss Lheat can measure the infrared temperature difference readability of the fused image, making the fused image visually clear while still retaining an interpretable thermal perception capability. The specific calculation method of Lheat is described in Section 3.5.3.

3.5.3. Heat-Consistency Loss Function

In practical infrared image acquisition, sensor noise and nonlinear responses may exist, but the grayscale intensity of pixels generally maintains an approximate monotonic relationship with temperature. Thus, the relative magnitude of pixel values can be regarded as ranking information for thermal levels. This property is crucial for tasks such as human detection or fault diagnosis. However, most current fusion algorithms, especially those dominated by visible-detail compensation, emphasize visual clarity while neglecting thermal fidelity. As a result, fused images often suffer from temperature inversion, grayscale drift, or energy leakage: the thermal ordering of pixels may be reversed, or local energy may become blurred. Such artifacts reduce the interpretability of fused images in temperature-sensitive applications.
To address the retention of thermal radiation information in infrared images, we propose a Heat-Consistency Loss function. This loss function, in a self-supervised manner, explicitly guides the model during training to preserve the temperature ranking structure, local energy distribution, and the significance of heat source regions in the infrared images. By doing so, it maintains the thermal radiation information in a physically reasonable way, ensuring that the fused image achieves visual clarity while maintaining interpretable thermal perception.
The overall loss function is composed of two complementary sub-terms—the Weighted Ranking Preservation Loss and the Local Energy Preservation Loss—which together constrain the fused image from both ordinal and energetic perspectives, and are jointly defined as follows:
L h e a t = λ 1 L r a n k + λ 2 L e n e r g y
where λ1 and λ2 are the balance coefficients of the two sub-losses.
From a theoretical standpoint, this design is grounded in the physical interpretability of infrared imaging. In infrared images, pixel intensity correlates monotonically with object temperature; thus, preserving the ranking order of pixel values ensures that the fused output maintains consistent thermal semantics—a form of monotonic mapping constraint. However, ranking preservation alone cannot guarantee radiometric accuracy, as it may allow global brightness drift. Therefore, the local energy preservation loss complements this by enforcing energy conservation within local neighborhoods, ensuring that average intensity distributions remain consistent with the thermal domain.
Together, these two constraints establish a balance between ordinal consistency and radiometric fidelity, providing both perceptual and physical interpretability for the fusion process. To further validate the rationality of this formulation, the mathematical definitions and derivations of the two sub-loss components are presented below.
a. Weighted Ranking Preservation Loss
To maintain the relative ranking relationship of pixel grayscale values in infrared images, we constructed a pixel pair weighted ranking preservation loss:
L r a n k = 1 ρ ( i , j ) ρ ω i , j · Re L U ( ( ( I i I R I j I R ) ( I i F I j F ) ) )
In this term, I i I R and I i F represent the grayscale values of the infrared image and the fused image at pixel i, respectively. If the infrared image has a higher grayscale value at i but the fused image has a lower value (i.e., the sorting is reversed), the loss term is greater than zero. This loss penalizes inconsistent sorting, encouraging the fused image to maintain a consistent relative temperature structure in terms of thermal sensation. The set ρ typically consists of randomly sampled pixel pairs or neighboring pixel pairs within the same image, ensuring low computational cost and local robustness. The saliency-guided weighting factor ωi,j, which highlights the importance of heat source regions, is calculated as follows:
ω i , j = S i + S j 2
where Si and Sj represent the pixel saliency maps of the pixel pair (i, j), respectively. The introduction of the pixel saliency map is aimed at highlighting the important heat source regions in the infrared image. The pixel saliency map S is generated based on the grayscale of the infrared image and is defined as follows:
S = σ ( k · ( I I R t ) )
Here, σ ( · ) denotes the Sigmoid function; k is the stretching factor that controls the steepness of the response boundary (set to 10); t is the significance threshold used to distinguish whether the grayscale region is a “heat source” region, which is set to 0.6; IIR represents the normalized grayscale values of the infrared image. The saliency-weighting mechanism ensures that greater loss is incurred when ranking errors occur in heat source regions, guiding the model to prioritize learning the ranking consistency of key target regions.
b. Local Energy Preservation Loss
To maintain the energy consistency within local regions of the image, that is, to ensure that the total amount of thermal radiation does not shift, we designed the following energy preservation term:
L e n e r g y = 1 N k = 1 N ( μ k I R μ k F )
where μ k I R represents the grayscale mean of the k-th local window in the infrared image, while μ k F denotes the grayscale mean of the corresponding window in the fused image. Essentially, this term penalizes brightness changes at the window level to prevent heat leakage, overexposure, or underexposure distortion in the fused image. It ensures that the “thermal intensity” in the local regions of the fused image is consistent with that of the infrared image, providing a reliable basis for subsequent tasks such as local temperature rise detection that rely on regional temperature judgments.
In our implementation, the local energy is computed using a fixed window of size w × w with w = 7. For each pixel location (i, j), we apply a sliding 7 × 7 averaging window centered at (i, j) with a stride of 1, resulting in fully overlapping neighborhoods across the entire image. Zero-padding of w / 2 pixels is used at the borders so that the local energy is defined for all spatial positions. In practice, this operation is implemented as a 2D convolution with a normalized box filter of size 7 × 7 (all ones divided by w2) applied to both the infrared image and the fused image, and the Local Energy Preservation Loss is defined as the mean squared error between the two resulting local mean maps.

4. Results

This section will conduct comparative experiments to evaluate the fusion performance of the proposed fusion methods. After introducing the experimental setup, we conducted several ablation studies to explore the effects of different elements in the proposed fusion network. We adopted multiple performance metrics to objectively assess the fusion performance. Our network was implemented on an NVIDIA GPU (GTX 4060, designed by NVIDIA Corporation, headquartered in Santa Clara, California, USA) and programmed using PyTorch 2.5.1.

4.1. Experiment Settings

During the training phase, in the first stage (two autoencoders) and the second stage (WA and decoder), 1084 pairs of infrared and visible-light images were selected from the MSRS dataset. The iteration number and batch size were set to 32 and 8, respectively. The cosine annealing strategy was employed for training, with the initial learning rate set to 0.01 and the minimum learning rate set to 0.0001. All these images were converted to grayscale and resized to 256 × 256.
In the first stage, the encoder was pre-trained for 16 epochs, followed by 8 epochs of fine-tuning in the second fusion stage. During encoder training, the reconstruction loss coefficient was set to 10,000 to enhance the representational capacity of low-level features. In the mixed fusion stage, the total loss function was defined as a weighted combination of multiple components, with balance coefficients α1 = 0.5 and α2 = 1, respectively. For the Heat-Consistency Loss, both sub-loss coefficients were fixed at 1, providing equal emphasis on ranking preservation and energy consistency. In the Local Energy Preservation Loss, the neighborhood window size was set to 7 × 7, which effectively captures regional thermal variations while maintaining computational efficiency. These hyperparameter settings were determined empirically to achieve a balance between visual clarity, thermal fidelity, and training stability.
All convolutional kernels used a 3 × 3 window with stride 1 and padding 1. Batch normalization and LeakyReLU (slope = 0.1) were applied throughout the fusion network.
The test images were selected from the TNO, MSRS, and M3FD datasets, containing 42, 361, and 300 pairs of infrared and visible-light images, respectively. The 42 pairs of cross-modal images in the TNO dataset mainly cover various military scenarios, including indoor and outdoor, night, complex illumination, and adverse weather conditions. The 361 pairs of cross-modal images in the MSRS dataset are precisely registered road scene images. The 300 pairs of cross-modal images in the M3FD dataset include not only road scenes but also a large number of university campus scenes. The visible-light images in the TNO dataset are grayscale, while those in the MSRS and M3FD datasets are three-channel RGB images.
To evaluate the fusion performance of our proposed network, six state-of-the-art fusion methods are chosen: an image fusion method based on guided filtering (GIFuse); a fusion method based on masked autoencoders (MaeFuse); a CNN-based method (PIAfusion); a Swin Transformer-based fusion method (SwinFusion); a cross-attention-based network (CrossFuse); a fusion method utilizing wavelet transform and Transformer (YDTR).
To objectively assess the quality of fused images, eight commonly used metrics are employed. Each captures different aspects of information preservation, structural fidelity, and visual quality:
Standard Deviation (SD): Reflects the overall contrast of the fused image. Higher values indicate stronger brightness variation and richer visual information.
Average Gradient (AG): Measures edge sharpness and detail clarity. A larger AG suggests that the fused image preserves more fine structures.
Sum of Correlations of Differences (SCD): Quantifies the amount of complementary information from the source images that is transferred into the fused result. Higher values imply better information integration.
Entropy (EN): Evaluates the amount of information contained in the fused image. A higher entropy indicates richer detail and more effective feature preservation.
Spatial Frequency (SF): Reflects the overall activity level in the image by analyzing intensity variations in horizontal and vertical directions. Larger SF values correspond to sharper and more textured images.
Correlation Coefficient (CC): Measures the correlation between the fused image and source images. Higher CC values imply better consistency and less information loss.
Noise Amplification Based Fusion (NABF): Evaluates the degree of noise introduced during fusion. Lower values are preferred, as they indicate fewer fusion-related artifacts.
Multi-Scale Structural Similarity (MS-SSIM): Assesses structural similarity across multiple scales. Higher values suggest that the fused image maintains perceptual quality and structural fidelity.

4.2. Ablation Study

In this section, we will analyze the influence of each key part: the number of WA blocks, the processing order of the infrared and visible-light branches within WA-L and WA-H, the influence of the WA block, and the removal of certain loss functions.

4.2.1. The Number of WA Blocks and the Processing Order of WA-L and WA-H

To conduct the ablation study on the WA module, which includes WA-L and WA-H modes, experiments are designed to explore the processing order of the infrared and visible-light branches and the optimal number of modules. Specifically, the processing sequence of the WA block hinges on the allocation of WA-L and WA-H; one is designated for the infrared branch, while the other is assigned to the visible-light branch, thereby forming the LR-HV and HR-LV configurations. Experiments with one module (L1-L1, H1-H1, L1-H1, and H1-L1), two modules (L2-L2, H2-H2, L2-H2, and H2-L2), and asymmetric module numbers (L2-H1, L1-H2, H2-L1, and H1-L2) are conducted. The notation “L1-H1” is used to denote the scenario where the number of WA-L in the infrared branch and WA-H in the visible-light branch is 1. The remaining naming conventions are analogous. The visualized results are presented in Figure 6, while the metric values are detailed in Table 1. The best values are indicated in bold, and the second-best values are denoted in italic and red.
As observed in Figure 6, the result obtained by L2-H1 module contains more detail information and less artificially generated noise. However, the visualized performance between these results still very close. Thus, six metrics are utilized to evaluate performance.
Table 1, compares with different block quantities and processing orders, showing that the proposed network with two WA-L blocks and one WA-H block (L2-H1) obtains better metric values (SCD, SF, CC). In many high-level visual tasks, a deeper number of module layers often means better performance. However, in image fusion, which is a low-level visual task, using too many modules can gradually obscure the unique semantic features of salient regions. Therefore, it is necessary to retain a certain number of infrared feature extraction modules while limiting the number of layers in order to better extract the features of salient regions.

4.2.2. The Influence of WA Block

To evaluate the effectiveness of the WA block, the original self-attention block without wavelet transform is utilized to replace the WA in our fusion network. To further analyze the influence of the WA-L and WA-H modes in our method, the visualization and evaluation metrics of these methods are as follows. In the comparison shown in Figure 7 and Table 2, “SA-WAH” indicates that WA-L is replaced by the original self-attention block, “WAL-SA” indicates that WA-H is replaced by the original self-attention block, and “two SA” means that only the original self-attention block is used. WAL-WAH is our method. In Table 2, the best values are indicated in bold, and the second-best values are denoted in italic and red.

4.2.3. Analysis for Loss Function

In the second training stage, the loss function (Lall) contains two items: the pixel intensity part (Lint) and the Heat-Consistency part (Lheat). In Table 3, “w/o Lint” indicates that only the pixel intensity part is utilized to train our network, and “w/o Lheat” means that only the Heat-Consistency part is used.
As detailed in Table 3, the fusion results obtained using the loss functions “w/o Lheat” and “w/o Lint” significantly lag behind those achieved with Lall. The former excessively focuses on detailed information, while the latter overly emphasizes the salient region features in infrared images. In contrast, our loss function design attends to the key information from both modalities, thereby obtaining higher quality fused images.

4.3. Fusion Result Analysis

In this section, six state-of-the-art fusion methods and eight metrics are chosen to evaluate the fusion performance of our proposed fusion network. The comparison experiments are conducted on three public fusion datasets (TNO, MSRS, and M3FD).
Compared with the fusion method based on guided filtering (GIFuse [22]), the CNN-based methods (PIAfusion [23]), the masked autoencoders-based methods (MaeFuse [18]), as well as the cross-attention-based method (CrossFuse [39]), the transformer-based method (SwinFusion [40]), and a fusion method utilizing wavelet transform and Transformer (YDTR [41]), the fused image obtained by our method contains more detailed prominent areas and clearer visuals on the M3FD, TNO, and MSRS datasets.
To assess the fused image quality objectively, eight metrics are selected. The metric values are shown in tables, the best values are denoted in bold, and the second-best values are denoted in italic and red.
In terms of visualization, since the visible images of MSRS and M3FD are in RGB space, they are converted into YCrCb color space to better present the mixed effects. “Y” indicates the luminance, and “Cr” and “Cb” denote the chrominance. To obtain the RGB fused image, “Y” are replaced by the fused image (gray-scale) generated by fusion method.

4.3.1. Results on M3FD Dataset

Table 4 compares the proposed method with state-of-the art alternatives on the M3FD dataset. Our proposed method achieves the best values in five metrics (SD, AG, CC, SCD, SF) and the second-best in three metrics (EN, NABF, MS-SSIM), indicating that it not only retains the information from the source images well but also enhances the clarity, contrast, and structural similarity while reducing noise and artifacts.
Figure 8 displays two pairs of infrared and visible images, “the person holding an umbrella” and “the trunk of the tree”, which are chosen to demonstrate the visual results generated by the existing fusion methods and the proposed method.

4.3.2. Results on TNO Dataset

Table 5 compares the proposed method with other state-of-the art fusion methods on the TNO dataset. Our proposed method achieves the best values in five metrics (SD, AG, EN, SF, NABF), the second-best values in two metrics (SCD, MS-SSIM) and one third value (CC), indicating that fused images based on our method have high contrast, clarity, information richness, detail sharpness, and low noise in the military scenario.
The fusion results obtained by the proposed method and other existing fusion methods on TNO (“window” and “soldier”) are shown in Figure 9.

4.3.3. Results on MSRS Dataset

Table 6 compares the proposed method with other state-of-the-art fusion methods on the MSRS dataset. Our proposed method achieves the best values in two metrics (SCD and SF), the second-best values in two metrics (SD, AG). Although the proposed method does not achieve all best values in the road scene, compared with the state-of-the-art fusion methods, it still achieves comparable metric values and even better structural preservation and richer texture details in image (SCD and SF). These observations indicate that our proposed method obtains better fusion performance in both visual evaluation and objective evaluation.
The fusion results obtained by the proposed method and other existing fusion methods on MSRS (“eaves” and “tire”) are shown in Figure 10.

4.3.4. Comparison of Computational Complexity

To comprehensively evaluate the computational complexity of the proposed fusion framework, we compare both the giga floating point operations (GFLOPs) and the average inference time against representative state-of-the-art IVIF methods. Table 7 summarizes the GFLOPs of all compared models measured at an input resolution of 768 × 576, together with the corresponding inference time on the same hardware platform. Specifically, all methods are executed on a unified environment equipped with an Intel Core i5-12600KF CPU (3.7 GHz), 16 GB RAM, and an NVIDIA RTX 4060 GPU to ensure a fair comparison of computational costs.
As shown in Table 7, CNN-based architectures such as PIAFusion [24] exhibit the highest computational burden, with significantly larger GFLOPs. In contrast, lightweight wavelet-based frameworks (e.g., YDTR [23]) achieve the lowest GFLOPs, reflecting its shallow feature extraction pipelines. The proposed method has the shortest inference time and maintains a moderate GFLOPs level while delivering superior fusion quality, demonstrating a better balance between complexity and performance.
Regarding inference speed, the average fusion time is computed over 42 image pairs from the TNO dataset. The results show that our method achieves an inference time of 144.16 ± 54.47 ms, representing both the lowest average latency and the smallest minimum value among all compared approaches. Such results verify that the proposed model achieves competitive computational efficiency while preserving high-quality fusion performance.

5. Conclusions

This paper introduces Wave-Cross, a wavelet–attention fusion framework for infrared–visible image fusion. By combining discrete wavelet transform with cross-attention, the WA-H and WA-L modules disentangle frequency-specific features, allowing visible images to contribute fine textures while preserving infrared thermal saliency. A novel Heat-Consistency Loss further enforces thermal order and local energy preservation, effectively avoiding temperature inversion and energy leakage.
Experiments on the TNO, MSRS, and M3FD datasets demonstrate that Wave-Cross consistently surpasses state-of-the-art methods in both quantitative metrics and visual quality. The framework demonstrates strong robustness in interference-prone scenarios, maintaining interpretable thermal contrast while enhancing structural details, which confirms its potential for applications such as surveillance, autonomous driving, and fault diagnosis.
While Wave-Cross demonstrates strong fusion performance across diverse infrared–visible scenarios, several limitations also suggest promising future research directions. In extreme imaging condition—such as visible images captured under severe low illumination—the high- and low-frequency sub-bands may become unstable, occasionally leading to slight edge over-enhancement or local intensity artifacts in the fused results. Future work will therefore explore more robust frequency decomposition strategies and adaptive attention stabilization mechanisms to improve reliability under such challenging sensor degradations.
Beyond addressing these limitations, several broader extensions are planned. First, we aim to develop lightweight architectural variants of Wave-Cross to enable real-time inference on embedded and edge hardware, which is crucial for mobile robotics and surveillance systems. Second, we intend to generalize the current two-modality framework toward multi-channel fusion scenarios, such as RGB–IR or RGB–T integration, which requires more flexible frequency-attention coupling across heterogeneous imaging domains. Third, incorporating task-driven supervision—such as detection, segmentation, or tracking objectives—will allow the fusion process to better align with downstream perception tasks, enabling fused representations that are not only visually enhanced but also semantically informative. These directions will further strengthen the practicality, adaptability, and generalization capability of Wave-Cross in real-world multimodal perception applications.

Author Contributions

Conceptualization, Z.Z.; methodology, J.G.; software, J.G.; validation, J.G. and S.L.; resources, X.Z.; writing—original draft preparation, J.G.; writing—review and editing, J.G. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Qiyuan Laboratory, grant number JCJQ-LA-001-077.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon reasonable request. No publicly archived datasets were generated or analyzed during this study due to institutional data-use restrictions.

Acknowledgments

The authors would like to thank the Qiyuan Laboratory for providing support.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Li, H.; Wu, X.-J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
  2. Shi, P.; Mao, F.; Zhang, R. DAE-Nest: A depth information extraction and enhancement fusion network for infrared and visible images. Opt. Commun. 2024, 560, 130441. [Google Scholar] [CrossRef]
  3. Huang, Z.; Yang, B.; Liu, C. RDCa-Net: Residual dense channel attention symmetric network for infrared and visible image fusion. Infrared Phys. Technol. 2023, 130, 104589. [Google Scholar] [CrossRef]
  4. Paramanandham, N.; Rajendiran, K. Infrared and visible image fusion using discrete cosine transform and swarm intelligence for surveillance applications. Infrared Phys. Technol. 2018, 88, 13–22. [Google Scholar] [CrossRef]
  5. Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58. [Google Scholar] [CrossRef]
  6. Yang, Z.; Huang, P.; He, D.; Cai, Z.; Yin, Z. SiamMMF: Multi-modal multi-level fusion object tracking based on Siamese networks. Mach. Vis. Appl. 2023, 34, 7. [Google Scholar]
  7. Zhang, Q.; Wang, L.; Li, H.; Ma, Z. Similarity-based multimodality image fusion with shiftable complex directional pyramid. Pattern Recognit. Lett. 2011, 32, 1544–1553. [Google Scholar] [CrossRef]
  8. Jun, C.; Lei, C.; Wei, L.; Yang, Y. Infrared and visible image fusion via gradientlet filter and salience-combined map. Multimed. Tools Appl. 2024, 83, 57223–57241. [Google Scholar] [CrossRef]
  9. Liu, Y.; Chen, X.; Ward, R.K.; Wang, Z.J. Image Fusion with Convolutional Sparse Representation. IEEE Signal Process. Lett. 2016, 23, 1882–1886. [Google Scholar] [CrossRef]
  10. Zhang, X.; Ma, Y.; Fan, F.; Zhang, Y.; Huang, J. Infrared and visible image fusion via saliency analysis and local edge-preserving multi-scale decomposition. J. Opt. Soc. Am. A 2017, 34, 1400–1410. [Google Scholar] [CrossRef]
  11. Meng, F.; Guo, B.; Song, M.; Zhang, X. Image fusion with saliency map and interest points. Neurocomputing 2016, 177, 1–8. [Google Scholar] [CrossRef]
  12. Zheng, Y.; Essock, E.; Hansen, B.C. An advanced image fusion algorithm based on wavelet transform—Incorporation with PCA and morphological processing. In Proceedings of the SPIE International Society for Optical Engineering, San Jose, CA, USA, 18–22 January 2004; Volume 5298. [Google Scholar]
  13. Li, H.; Wu, X. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
  14. Li, H.; Wu, X.-J.; Kittler, J. Infrared and Visible Image Fusion using a Deep Learning Framework. In Proceedings of the 2018 International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2705–2710. [Google Scholar]
  15. Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
  16. Zhang, H.; Ma, J. SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
  17. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
  18. Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A Generative Adversarial Network with Multiclassification Constraints for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 5005014. [Google Scholar] [CrossRef]
  19. Li, J.; Jiang, J.; Liang, P.; Ma, J.; Nie, L. MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training. IEEE Trans. Image Process. 2025, 34, 1340–1353. [Google Scholar] [CrossRef]
  20. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. arXiv 2022, arXiv:2211.14461. [Google Scholar]
  21. Singh, S.; Singh, H.; Gehlot, A.; Kaur, J.; Gagandeep. IR and visible image fusion using DWT and bilateral filter. Microsyst. Technol. 2023, 29, 457–467. [Google Scholar] [CrossRef]
  22. Ravi, J.; Narmadha, R. Optimized dual-tree complex wavelet transform aided multimodal image fusion with adaptive weighted average fusion strategy. Sci. Rep. 2024, 14, 30246. [Google Scholar] [CrossRef]
  23. Wang, W.; Deng, L.-J.; Vivone, G. A general image fusion framework using multi-task semi-supervised learning. Inf. Fusion 2024, 108, 102414. [Google Scholar] [CrossRef]
  24. Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83–84, 79–92. [Google Scholar] [CrossRef]
  25. Toet, A. The TNO Multiband Image Data Collection. Data Brief. 2017, 15, 249–251. [Google Scholar] [CrossRef] [PubMed]
  26. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Zhang, K.; Xu, S.; Chen, D.; Timofte, R.; Van Gool, L. Equivariant Multi-Modality Image Fusion. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25912–25921. [Google Scholar]
  27. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. arXiv 2022, arXiv:2203.16220. [Google Scholar]
  28. Kumar, B.K.S. Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal Image Video Process. 2013, 7, 1125–1143. [Google Scholar]
  29. Li, H.; Manjunath, B.S.; Mitra, S.K. Multisensor Image Fusion Using the Wavelet Transform. Graph. Models Image Process. 1995, 57, 235–245. [Google Scholar]
  30. Mitianoudis, N.; Stathaki, T. Pixel-based and region-based image fusion schemes using ICA bases. Inf. Fusion 2007, 8, 131–142. [Google Scholar] [CrossRef]
  31. Song, M.; Lu, L.; Peng, Y.; Jiang, T.; Li, J. Infrared & visible images fusion based on redundant directional lifting-based wavelet and saliency detection. Infrared Phys. Technol. 2019, 101, 45–55. [Google Scholar] [CrossRef]
  32. Liu, S.; Deng, W. Very deep convolutional neural network based image classification using small training sample size. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 730–734. [Google Scholar]
  33. Li, H.; Wu, X.-J.; Durrani, T.S. Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys. Technol. 2019, 102, 103039. [Google Scholar]
  34. Li, H.; Wu, X.-J.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
  35. Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
  36. Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.-P. DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef]
  37. Wang, J.; Chen, Y.; Sun, X.; Xing, H.; Zhang, F.; Song, S.; Yu, S. Advancing infrared and visible image fusion with an enhanced multiscale encoder and attention-based networks. iScience 2024, 27, 110915. [Google Scholar] [CrossRef]
  38. Wang, J.; Ling, Q. FDNet: Frequency Decomposition Network for Learned Image Compression. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11241–11255. [Google Scholar] [CrossRef]
  39. Li, H.; Wu, X.-J. CrossFuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 2024, 103, 102147. [Google Scholar] [CrossRef]
  40. Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
  41. Tang, W.; He, F.; Liu, Y. YDTR: Infrared and Visible Image Fusion via Y-Shape Dynamic Transformer. IEEE Trans. Multimed. 2023, 25, 5413–5428. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the Wave-Cross fusion model. The connections between the encoder and decoder are two residual structures. H and L mode WA modules are used for visible-light characteristics and infrared characteristics, respectively. Each line or arrow represents a data-flow path; the yellow and blue lines in this figure, however, are specifically residual connections. GELU is employed as the activation function. GELU, serving as the activation function, is inserted between the two fully-connected layers.
Figure 1. Overall architecture of the Wave-Cross fusion model. The connections between the encoder and decoder are two residual structures. H and L mode WA modules are used for visible-light characteristics and infrared characteristics, respectively. Each line or arrow represents a data-flow path; the yellow and blue lines in this figure, however, are specifically residual connections. GELU is employed as the activation function. GELU, serving as the activation function, is inserted between the two fully-connected layers.
Electronics 15 00321 g001
Figure 2. Encoder architecture with two residual structures connected to the decoder. The DenseBlock of encoder architecture densely concatenates the feature-maps of all preceding layers as input to each subsequent layer. The layer-by-layer connections of dense blocks in the decoder are represented as colored lines.
Figure 2. Encoder architecture with two residual structures connected to the decoder. The DenseBlock of encoder architecture densely concatenates the feature-maps of all preceding layers as input to each subsequent layer. The layer-by-layer connections of dense blocks in the decoder are represented as colored lines.
Electronics 15 00321 g002
Figure 3. WA module architecture with H or L mode. The two WA modes after wavelet transform mean that the feature numbers of the two branches in the input WA module are interchanged. DWT operation is wavelet transform and IWT operation is inverse wavelet transform.
Figure 3. WA module architecture with H or L mode. The two WA modes after wavelet transform mean that the feature numbers of the two branches in the input WA module are interchanged. DWT operation is wavelet transform and IWT operation is inverse wavelet transform.
Electronics 15 00321 g003
Figure 4. Decoder architecture used in the fusion stage. Residual structures are employed to mix the features from the pre-trained encoders of the two modalities with the features from the decoding stage.
Figure 4. Decoder architecture used in the fusion stage. Residual structures are employed to mix the features from the pre-trained encoders of the two modalities with the features from the decoding stage.
Electronics 15 00321 g004
Figure 5. The decoder structure used in the pre-training encoder stage is shown in the figure. Green and purple arrows in the figure indicate the residual structure. This structure is different from the decoder structure used in the subsequent fusion stage.
Figure 5. The decoder structure used in the pre-training encoder stage is shown in the figure. Green and purple arrows in the figure indicate the residual structure. This structure is different from the decoder structure used in the subsequent fusion stage.
Electronics 15 00321 g005
Figure 6. The results of ablation studies with different number of WA blocks and processing order of the infrared and visible-light branches within WA-L and WA-H on the TNO dataset. The yellow-framed smoke delineates regions of pronounced elevated temperature, whereas the blue-framed areas correspond to background structures such as walls.
Figure 6. The results of ablation studies with different number of WA blocks and processing order of the infrared and visible-light branches within WA-L and WA-H on the TNO dataset. The yellow-framed smoke delineates regions of pronounced elevated temperature, whereas the blue-framed areas correspond to background structures such as walls.
Electronics 15 00321 g006
Figure 7. Fusion result obtained by different ablation settings on the TNO dataset. The WAL-WAH method accurately reconstructs the salient regions in infrared images while preserving the smooth details and luminance characteristics from visible-light images.
Figure 7. Fusion result obtained by different ablation settings on the TNO dataset. The WAL-WAH method accurately reconstructs the salient regions in infrared images while preserving the smooth details and luminance characteristics from visible-light images.
Electronics 15 00321 g007
Figure 8. Fusion results on M3FD. Our method maximizes the retention of visible-light details (clearly visible tree and floor details) and essentially restores the temperature readability features of the infrared salient regions. The umbrella-holding individual within the yellow bounding box signifies a region of markedly elevated temperature, while the blue-framed areas correspond to background elements such as trees.
Figure 8. Fusion results on M3FD. Our method maximizes the retention of visible-light details (clearly visible tree and floor details) and essentially restores the temperature readability features of the infrared salient regions. The umbrella-holding individual within the yellow bounding box signifies a region of markedly elevated temperature, while the blue-framed areas correspond to background elements such as trees.
Electronics 15 00321 g008
Figure 9. Fusion results on TNO. Our method maximizes the retention of visible-light details (clear details of the windows and the trees on the left side) and largely restores the thermal readability of key infrared salient regions. The person enclosed by the blue bounding box denotes a region of significantly elevated temperature, whereas the windows within the yellow frame represent background areas.
Figure 9. Fusion results on TNO. Our method maximizes the retention of visible-light details (clear details of the windows and the trees on the left side) and largely restores the thermal readability of key infrared salient regions. The person enclosed by the blue bounding box denotes a region of significantly elevated temperature, whereas the windows within the yellow frame represent background areas.
Electronics 15 00321 g009
Figure 10. Fusion results on MSRS. Our method maximizes the retention of visible-light details (clear road details and tire details) and effectively preserves the thermal interpretability of salient infrared regions.
Figure 10. Fusion results on MSRS. Our method maximizes the retention of visible-light details (clear road details and tire details) and effectively preserves the thermal interpretability of salient infrared regions.
Electronics 15 00321 g010
Table 1. Ablation study of different numbers and processing orders of WA-L and WA-H blocks on fusion performance (TNO dataset).
Table 1. Ablation study of different numbers and processing orders of WA-L and WA-H blocks on fusion performance (TNO dataset).
NameSDAGSCDENSFCC
L1-L143.684.691.737.1612.430.479
H1-H143.544.741.777.1712.560.481
L1-H141.644.101.727.0811.020.460
H1-L142.874.621.717.1312.070.477
L2-L246.124.821.707.2312.590.468
H2-H246.564.941.687.2412.640.466
L2-H244.384.761.697.1712.800.463
H2-L243.574.691.767.1612.360.489
H1-L244.714.911.687.1612.880.462
L1-H243.694.431.717.1511.810.478
H2-L144.984.651.737.1912.190.476
L2-H1(ours)44.284.931.787.1913.120.490
Table 2. Ablation study of the impact of the WA module compared with standard self-attention on fusion performance (TNO dataset).
Table 2. Ablation study of the impact of the WA module compared with standard self-attention on fusion performance (TNO dataset).
NameSDAGSCDENSFCC
SA-WAH45.794.831.707.2212.660.463
WAL-SA44.423.651.387.109.820.393
two SA42.414.571.767.1112.150.491
WAL-WAH44.284.931.787.1913.120.490
Table 3. Ablation study on the effect of different loss function settings (pixel intensity loss and Heat-Consistency Loss) on the fusion performance using the TNO dataset.
Table 3. Ablation study on the effect of different loss function settings (pixel intensity loss and Heat-Consistency Loss) on the fusion performance using the TNO dataset.
NameSDAGSCDENSFCC
Lall45.794.831.707.2212.660.463
w/o Lheat42.874.621.717.1312.070.477
w/o Lint38.582.650.836.6710.680.271
Table 4. Quantitative comparison of fusion methods on the M3FD dataset using eight objective evaluation metrics. Except for the NABF metric (indicated with an arrow), higher values are preferable for all other indicators.
Table 4. Quantitative comparison of fusion methods on the M3FD dataset using eight objective evaluation metrics. Except for the NABF metric (indicated with an arrow), higher values are preferable for all other indicators.
NameSDAGSCDENSFCCNABF↓MS_SSIM
SwinFusion35.844.611.566.7913.690.520.01640.9333
GIFuse27.354.631.456.5113.940.540.01720.9426
MaeFuse36.103.761.757.029.530.570.01420.9477
PIAFusion36.565.061.506.8415.100.490.01920.9182
CrossFuse30.933.880.926.5911.620.420.02740.8714
YDTR27.993.301.516.5510.100.550.02710.9104
Ours37.625.281.806.9915.970.580.01490.9447
Table 5. Quantitative comparison of fusion methods on the TNO dataset using eight objective evaluation metrics. Except for the NABF metric (indicated with an arrow), higher values are preferable for all other indicators.
Table 5. Quantitative comparison of fusion methods on the TNO dataset using eight objective evaluation metrics. Except for the NABF metric (indicated with an arrow), higher values are preferable for all other indicators.
NameSDAGSCDENSFCCNABF↓MS_SSIM
SwinFusion39.454.211.716.8910.720.4740.03580.8960
GIFuse28.233.861.566.5310.520.4880.04490.8999
MaeFuse35.303.681.826.918.180.5240.03590.9381
PIAFusion37.143.831.606.819.620.4500.05810.8758
CrossFuse39.973.761.346.929.960.4020.07170.8172
YDTR28.032.771.566.437.620.4940.06050.8613
Ours44.284.931.787.1913.120.4900.03370.9075
Table 6. Quantitative comparison of fusion methods on the MSRS dataset using eight objective evaluation metrics. Except for the NABF metric (indicated with an arrow), higher values are preferable for all other indicators.
Table 6. Quantitative comparison of fusion methods on the MSRS dataset using eight objective evaluation metrics. Except for the NABF metric (indicated with an arrow), higher values are preferable for all other indicators.
NameSDAGSCDENSFCCNABF↓MS_SSIM
SwinFusion43.003.571.696.6211.090.5980.01120.9692
GIFuse32.493.311.386.3110.410.6110.01120.9584
MaeFuse38.303.461.716.589.570.6460.00830.9680
PIAFusion45.343.971.706.6412.120.6000.00950.9704
CrossFuse36.503.031.066.519.670.5440.02430.9308
YDTR25.372.201.135.657.400.6310.02670.8872
Ours43.303.781.756.5312.220.6070.01560.9611
Table 7. Quantitative comparison of computational complexity of fusion methods.
Table 7. Quantitative comparison of computational complexity of fusion methods.
NameRuntime(ms)↓GFLOPs↓
SwinFusion2097.25 ± 987.86440.700
GIFuse378.04 ± 158.40254.453
MaeFuse397.00 ± 6.151004.585
PiaFusion3888.72 ± 1858.82618.989
CrossFuse511.35 ± 183.57194.279
YDTR187.24 ± 61.3554.992
Ours144.16 ± 54.47319.447
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Z.; Gu, J.; Li, S.; Shi, Y.; Zhou, X. Wave-Cross: Balancing Thermal Saliency and Visual Detail in Infrared–Visible Image Fusion. Electronics 2026, 15, 321. https://doi.org/10.3390/electronics15020321

AMA Style

Zhou Z, Gu J, Li S, Shi Y, Zhou X. Wave-Cross: Balancing Thermal Saliency and Visual Detail in Infrared–Visible Image Fusion. Electronics. 2026; 15(2):321. https://doi.org/10.3390/electronics15020321

Chicago/Turabian Style

Zhou, Zhiguo, Jiahao Gu, Shuya Li, Yonggang Shi, and Xuehua Zhou. 2026. "Wave-Cross: Balancing Thermal Saliency and Visual Detail in Infrared–Visible Image Fusion" Electronics 15, no. 2: 321. https://doi.org/10.3390/electronics15020321

APA Style

Zhou, Z., Gu, J., Li, S., Shi, Y., & Zhou, X. (2026). Wave-Cross: Balancing Thermal Saliency and Visual Detail in Infrared–Visible Image Fusion. Electronics, 15(2), 321. https://doi.org/10.3390/electronics15020321

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop