Next Article in Journal
Innovations in IT Recruitment: How Data Mining Is Redefining the Search for Best Talent (A Case Study of IT Recruitment in Morocco)
Previous Article in Journal
Integrating RAG for Smarter Animal Certification Platforms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion

1
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
2
Yunnan Transportation Engineering Quality Testing Co., Ltd., Kunming 650500, China
*
Author to whom correspondence should be addressed.
Information 2025, 16(10), 844; https://doi.org/10.3390/info16100844
Submission received: 26 August 2025 / Revised: 16 September 2025 / Accepted: 25 September 2025 / Published: 30 September 2025

Abstract

As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion Network (TFI-Fusion). Based on a hierarchical triple-stream feature interaction mechanism, the network achieves high-quality fusion through a two-stage, separate-model processing approach: In the first stage, a single model extracts low-rank components (representing global structural features) and sparse components (representing local detail features) from source images via the Low-Rank Sparse Decomposition (LSRSD) module, while capturing cross-modal shared features using the Shared Feature Extractor (SFE). In the second stage, another model performs fusion and reconstruction: it first enhances the complementarity between low-rank and sparse features through the innovatively introduced Bi-Feature Interaction (BFI) module, realizes multi-level feature fusion via the Triple-Feature Interaction (TFI) module, and finally generates fused images with rich scene representation through feature reconstruction. This separate-model design reduces memory usage and improves operational speed. Additionally, a multi-objective optimization function is designed based on the network’s characteristics. Experiments demonstrate that TFI-Fusion exhibits excellent fusion performance, effectively preserving image details and enhancing feature complementarity, thus providing reliable visual data support for downstream tasks.

1. Introduction

Image fusion technology, as a crucial method for multi-source information processing, significantly enhances scene perception in complex environments by integrating complementary information from different imaging modalities [1,2,3]. This technology has demonstrated tremendous application value in military reconnaissance, intelligent security, medical diagnosis, and other fields. Among these applications, infrared and visible image fusion has attracted considerable attention due to its unique complementary characteristics. Visible light images provide rich spatial details and color information, but their imaging quality is highly susceptible to lighting conditions [3,4,5]. In contrast, infrared images overcome the limitations of visible light by capturing the thermal radiation characteristics of objects, enabling all-weather target detection, albeit with lower spatial resolution and a lack of texture details [3,4,6]. Effectively fusing these two modalities can leverage their respective advantages to provide more reliable input for advanced vision tasks such as target detection and recognition [6,7].
In the evolution of image fusion technology, early methods primarily relied on traditional signal processing approaches. Multi-scale transformation techniques, including wavelet and contourlet transforms, decomposed source images at different scales and fused the sub-band coefficients using predefined rules [8,9]. While computationally efficient, these methods had notable limitations: the selection of decomposition layers often required empirical determination (typically 3–5 layers), the design of fusion rules for high/low-frequency components proved challenging, and they frequently produced artifacts when processing discontinuous edges [9,10].
The advent of deep learning brought significant advancements through CNN-based approaches. Architectures like VGG and ResNet improved feature extraction capabilities, yet faced inherent constraints due to their limited receptive fields and progressive loss of edge details in deeper networks [11,12,13].
GAN-based methods introduced innovative solutions through adversarial training, with models like FusionGAN incorporating discriminator networks to enhance output quality. However, these approaches suffered from training instability and difficulties in balancing feature preservation with enhancement [14,15,16].
More recently, Transformer-based models have demonstrated potential through global relationship modeling via self-attention mechanisms. While effective, these methods encounter substantial computational demands when processing high-resolution images and exhibit feature competition issues in their multi-head attention structures [12,17,18].

Our Contribution

This paper addresses the key challenges of insufficient feature interaction and detail loss in infrared and visible image fusion tasks by proposing a novel hierarchical Triple-Feature Interaction network, termed the Triple-Feature Interaction Fusion Network (TFI-Fusion) as shown in Figure 1. The model achieves effective multimodal feature fusion through three meticulously designed processing stages:
In stage 1 (feature extraction), a dual-branch feature decoupling architecture is constructed. First, the innovative Low-Rank Sparse Decomposition (LRSD) module decouples input images into features: the low-rank component captures global structure and background information, while the sparse component focuses on local details and salient features. Simultaneously, to fully exploit cross-modal common features, the Shared Feature Extractor (SFE) is designed, employing a weight-shared convolutional network to extract shared representations from bimodal data. This multi-level feature decoupling strategy lays a solid foundation for subsequent feature interaction.
Stage 2 consists of two parts: Part a (Feature Interaction) and Part b (Feature Reconstruction).
In Part a (Feature Interaction), first, the Bi-Feature Interaction (BFI) module adaptively adjusts fusion weights via a cross-attention mechanism, achieving dynamic complementary enhancement between low-rank features and sparse features. Building on this, the Triple-Feature Interaction (TFI) module performs deep multi-scale fusion of low-rank features, sparse features, and shared features, ensuring comprehensive interaction among different modal features.
In Part b, for the interacted features, a Swin Transformer-based architecture is adopted for processing: multi-scale Swin Transformer blocks first conduct deep encoding, leveraging window-based self-attention to capture long-range dependencies and shift-window operations to strengthen inter-region feature interaction. Subsequently, a lightweight conv_out output module maps the encoded high-level features back to the image space.
Figure 1. The overview of the proposed Hierarchical Triple-Stream Feature Interaction Network.
Figure 1. The overview of the proposed Hierarchical Triple-Stream Feature Interaction Network.
Information 16 00844 g001
This hybrid architecture, which combines the advantages of global attention and local convolution, enables the accurate reconstruction of local details while ensuring the global consistency of fused features. Additionally, the entire network adopts an end-to-end training approach, jointly optimizing the feature interaction and reconstruction modules to achieve adaptive balancing of different modal features, ultimately generating high-quality fused images with rich semantic information and fine texture details.
To optimize the training process, an innovative composite loss function system is designed for multi-level and multi-feature scenarios: (1) structural similarity-based content loss ensures complete preservation of global structural information; (2) Charbonnier loss enhances model robustness to outliers; (3) cosine similarity loss optimizes feature space alignment. Through multi-objective joint optimization, this loss function system significantly improves fusion performance, ensuring the complete retention of key source image information while effectively enhancing the adaptability of fusion results to downstream visual tasks.
In summary, the main contributions of this paper are as follows:
  • This study proposes a Hierarchical Triple-Feature Interaction Fusion Network (TFI-Fusion) with a three-stage architecture for high-quality infrared-visible image fusion.
  • This study employs Low-Rank Sparse Decomposition (LRSD) and a Shared Feature Extractor (SFE) to decouple and extract structural, detailed, and cross-modal features.
  • This study designs Bi-Feature Interaction (BFI) and Triple-Feature Interaction (TFI) modules to achieve dynamic multi-level feature fusion via cross-attention mechanisms.
  • This study builds a Swin Transformer-based reconstruction module, combining global attention and local convolution for detail preservation.
  • This study introduces a multi-objective loss (structural/Charbonnier/cosine similarity) to enhance fusion performance through end-to-end training.
The structure of this paper is as follows: Section 2 provides the necessary preliminaries. In Section 3, the proposed method is elaborated in detail. Section 4 presents comprehensive experimental results and discussions based on the public dataset. In Section 5, we conclude our work. Finally, we conducted an in-depth analysis of our work and future improvements in Section 6.

2. Related Work

The development of image fusion technology has evolved from traditional methods to modern deep learning approaches, with recent research primarily focusing on three key technical routes: encoded representation, convolutional neural networks (CNNs), and Transformers.

2.1. Sparse Representation Methods

Sparse coding, a time-honored technique in the realm of image fusion, excels at feature representation by constructing overcomplete dictionaries. In 2009, Yang et al. [19] blazed a trail with the joint sparse representation method, which marked a significant breakthrough in multimodal feature fusion. By leveraging shared dictionary atoms, this method effectively aligned features across diverse modalities, laying a solid foundation for subsequent research. Li et al. [20] took the exploration further with the deep sparse coding network. They ingeniously integrated traditional sparse coding with deep learning through end-to-end dictionary learning optimization. This innovation represented a pivotal shift towards neural network-based approaches, opening up new possibilities for the field. Building on previous achievements, Liu et al. [21] introduced a multi-scale sparse representation framework that brought about remarkable enhancements in three critical dimensions. Firstly, by incorporating pyramid decomposition, the framework significantly bolstered multi-scale feature extraction, enabling a more comprehensive capture of image details. Secondly, the design of adaptive sparse constraints optimized feature selection, enhancing the model’s adaptability to various scenarios. Thirdly, the adoption of iterative optimization algorithms substantially improved computational efficiency, making the method more practical for real-world applications. Despite these advancements, sparse coding-based methods still face notable challenges. Their heavy reliance on manually designed sparse constraints restricts generalization, while the exponential growth of computational complexity with dictionary size poses scalability issues. These limitations underscore the need for further innovation to better handle complex image fusion tasks.

2.2. CNN-Based Methods

Convolutional neural networks (CNNs) have brought about a revolutionary transformation in the research paradigm of image fusion. Through a series of architectural innovations, CNNs have propelled significant progress in this field. The journey of CNN-enabled image fusion began with the groundbreaking work of Li et al. [22]. They introduced pretrained VGG networks for deep feature fusion, effectively showcasing the prowess of CNN-based hierarchical feature extraction. Following this seminal contribution, the methodological terrain of image fusion has witnessed remarkable expansion. Hou et al. [23] developed the unsupervised VIF-Net framework, integrating adversarial learning to produce more natural-looking fusion results. Jian et al. [24] proposed SEDRFuse, which features a symmetric encoder–decoder structure along with residual blocks, thereby enhancing feature preservation capabilities. Xu et al. [25] introduced the DRF approach, leveraging disentangled representation learning to more effectively separate and combine modality-specific features. Further advancements came in the form of specialized fusion strategies. Xu et al. [26] proposed classification saliency-based rules, which prioritize semantically significant regions during the fusion process. Li et al. [27] presented RFN-Nest, which utilizes nested connections to achieve superior multi-scale fusion. Wang et al. [28] addressed the issue of misalignment by developing a solution based on cross-modality generation and registration. Additionally, the research community has made crucial progress in performance optimization. For instance, Li et al. [29] proposed ResFuse, which employs residual learning to mitigate the problem of vanishing gradients. Despite the state-of-the-art performance achieved by CNN-based methods in local feature extraction, they are confronted with two fundamental limitations. First, the receptive field of convolutional operations is restricted, which impedes the modeling of long-range dependencies. Second, pooling operations inevitably lead to spatial information loss, thereby compromising the preservation of fine details. These inherent challenges have spurred recent investigations into Transformer-based architectures, which hold the potential to overcome these limitations while maintaining local feature precision.

2.3. Transformer-Based Methods

The rise of Transformers has brought revolutionary breakthroughs to image fusion. The seminal work of Vaswani et al. [30] first introduced the Transformer architecture and demonstrated its effectiveness for machine translation tasks. Following its remarkable success in natural language processing (NLP), the Transformer paradigm has been extensively adapted to address various computer vision challenges, including image classification, object detection, and image fusion. SwinFusion was proposed by Ma et al., which reduced computational complexity using shifted window attention [31]; CrossFormer was developed by Wang et al., which achieved multimodal feature interaction through cross-attention [32]. These methods demonstrate advantages in three aspects: global dependency modeling, adaptive feature interaction mechanisms, and multi-scale fusion performance. However, Transformer-based methods still face two key challenges: the need for large amounts of training data to avoid overfitting, and computational complexity that grows quadratically with image size.
The field of image fusion currently exhibits three notable characteristics: First, sparse representation methods are transitioning from standalone algorithms to embedded modules in deep networks. Second, CNN architectures continue to innovate, with recent studies improving performance through multi-level feature aggregation. Finally, hybrid architectures combining Transformers and CNNs have become a new trend. Future research needs to address core issues, including how to design more efficient cross-modal interaction mechanisms, how to balance computational complexity with fusion quality, and how to improve model generalization in real-world complex scenarios.
Recently, cross-modal interactions, which enable the exchange of information across modalities, have been widely adopted in Transformer-based multi-modal computer vision tasks [3,33,34,35]. These interactions allow Transformers to dynamically integrate complementary features from different modalities, significantly enhancing the performance of image fusion. However, implementing cross-modal interactions effectively remains challenging. Modality disparities in resolution and semantic content often impede seamless integration. Future research should focus on adaptive strategies to bridge these gaps for more robust image fusion.

3. Method

3.1. Overall Architecture

As illustrated in Figure 1, this study proposes a multi-feature input and multi-stage optimization framework for infrared and visible image fusion, which consists of three key phases: feature pre-extraction, feature reconstruction, and feature fusion.
Phase 1 (Feature Extraction): In the feature pre-extraction stage, we first employ the Low-Rank Sparse Decomposition (LRSD) module to decompose input IR and visible images into structure-representing low-rank features ( L x / L y ) and detail-representing sparse features ( S x / S y ), then concatenate features from the same modality to obtain integrated low-rank feature L and sparse feature S, while extracting cross-modal common features C through the Shared Feature Extractor (SFE), ultimately using L, S and C together as input features for the second stage.
The triple-stream architecture integrating low-rank, sparse, and shared features offers significant advantages over single-stream and dual-stream models: First, it achieves full coverage of three-dimensional key information (global–local–semantic). It uses low-rank features to capture cross-modal consistent global structures, sparse features to preserve modality-specific local details, and shared features to anchor cross-modal semantic commonalities, addressing the information loss issues of single-stream architectures (excessive information compression) and dual-stream architectures (lack of semantic anchors). Second, it features a dynamic interaction and verification mechanism with “triple mutual calibration”: shared features act as a semantic filter to eliminate noise and redundancy in low-rank/sparse features, while low-rank/sparse features in turn optimize the semantic bias of shared features, solving the defects of single-stream architectures (easy noise diffusion) and dual-stream architectures (lack of cross-modal verification) [36,37,38,39].
Phase 2(a) (Feature Interaction): Building upon the extracted low-rank feature L, sparse feature S, and shared feature C from the first stage, we first employ the Bi-Feature Interaction (BFI) module for dynamic feature fusion. By adopting a cross-attention-like mechanism, BFI adaptively adjusts complementary weights between low-rank and sparse features to generate a primary fused feature ( F B I L / F B I S ) . The Tri-Feature Interaction (TFI) module then performs hierarchical feature integration on F B I L , F B I S and shared feature C, where a multi-scale attention mechanism enables cross-modal feature refinement and enhancement, ultimately producing the enhanced fusion feature F T I with comprehensive multimodal information.
Phase 2(b) (Feature Reconstruction and Output): In the final fusion stage, the fused feature F from previous stages is fed into a Swin Transformer-based feature reconstruction network. Leveraging its hierarchical attention mechanism, the network performs multi-scale semantic enhancement and spatial relationship modeling, effectively exploring deep correlations among cross-modal features. Ultimately, through our specially designed lightweight convolutional output module ( C o n v o u t ), the high-dimensional semantic features encoded by the Swin Transformer are efficiently mapped to the image space, producing the final fused image I f with well-preserved structural integrity and enriched details.

3.2. Phase 1 Feature Extraction

In this stage, the input infrared I i r and visible images I v i are decomposed through our Low-Rank Sparse Decomposition (LRSD) module to extract complementary representations: low-rank features ( L x / L y ) capturing structural contours and sparse features ( S x / S y ) preserving textural details. These modality-specific features are then concatenated to form consolidated representations (L for low-rank, S for sparse features). Simultaneously, a Shared Feature Extractor (SFE) learns cross-modal common features C that establish the fundamental basis for subsequent fusion operations.
The low-rank features ( L x / L y ) and sparse features ( S x / S y ) extracted by the LRSD module, as outputs of the network’s first stage, provide subsequent deep learning networks with structured, redundancy-removed high-quality inputs. Through mathematical prior constraints, they effectively filter noise and strip away invalid information, enabling the second-stage network to focus on core tasks such as cross-modal alignment and detail enhancement without expending resources on noise processing and feature classification. Meanwhile, the mathematical priors of LRSD ensure the basic stability of feature decomposition, while deep learning networks can optimize its dictionaries and parameters through backpropagation to adapt to complex scenarios, forming a complementary enhancement of “prior constraints + data-driven” mechanisms.
In contrast, traditional simple convolutional layers have significant limitations: their noise suppression relies on “passive filtering” through local neighborhood averaging, which cannot distinguish between noise and valid details, easily leading to residual noise and blurred details. Moreover, the features they extract are mixed representations within local receptive fields, where global structures, local details, noise, and redundancy are intertwined without clear differentiation, belonging to “black-box features”—which is far less efficient and reliable than the structured feature expression achieved by LRSD through explicit mathematical decomposition.

3.2.1. Low-Rank Sparse Decomposition Module

In the proposed Hierarchical Triple-Stream Feature Interaction Network framework, the first-stage feature extraction layer employs an LRSD (Low-Rank and Sparse Decomposition) module based on low-rank sparse representation learning. By optimizing the objective function
min D 1 , D 2 , L , S L + λ S 1 , s . t . X = D 1 L + D 2 S
where X is the input data in which each column denotes an image patch reshaped to a vector, L indicates the low-rank coefficient, S denotes the sparse coefficient. D 1 and D 2 are dictionaries which are utilized to project L and S into the base part and the salient part (the specific solution algorithm refers to the literature LRRNET [40]). This module effectively decouples the input multimodal features I i r / I v i into low-rank feature L and the sparse feature S. The low-rank feature L captures globally shared structural information across modalities, including low-frequency components, such as illumination distribution and object contours, through nuclear norm constraints. Meanwhile, the sparse feature S extracts modality-specific local detail features, such as texture details and edge information, based on L 1 norm constraints.
Drawing on LRRNet’s Low-Rank Sparse Decomposition framework, our LRSD extracts low-rank features (L) via nuclear norm constraints—these inherently correspond to cross-modal consistent global structures (e.g., shared object contours, scene layouts, and other low-frequency components in infrared and visible images). The nuclear norm minimization simultaneously suppresses random noise and background redundancy. It extracts sparse features (S) through L 1 norm constraints, specifically capturing modality-specific local details (e.g., visible light textures, infrared thermal target edges, and other high-frequency components). The L 1 norm’s sparsity constraint ensures that only high-information details are retained. This separation, rather than an “empirical selection,” is a mathematically precise division, enabling deep networks to directly obtain structurally clear, noise-redundancy-free feature inputs without “blindly learning” to distinguish global and local information from mixed features, significantly reducing learning complexity.
Compared to existing approaches, such as LRRNET [40], that adopt a single-feature decomposition strategy, the innovation of this study lies in constructing a hierarchical triple-stream fusion architecture. This architecture first performs primary feature decoupling through the LRSD module to generate the low-rank and sparse feature pair L, S. Subsequently, these decoupled features are combined with the shared feature C extracted from the original input to form a three-channel feature group L, S, C, which serves as the input for subsequent feature interaction stages. This design offers three significant advantages: First, retaining the original shared feature C ensures information integrity, effectively preventing potential information loss during feature decoupling. Second, explicitly separating structural feature L and detail feature S greatly enhances the interpretability of feature representation. Finally, this structured feature primitive design provides a solid foundation for subsequent multi-stage feature interactions (see the Section 4.6 Ablation study for details).
The experimental validation fully demonstrates the effectiveness of this design. Quantitative and qualitative analysis results indicate that our method outperforms baseline models in terms of structural consistency and detail preservation. Furthermore, through ablation studies in Section 4.6 we further validate the synergistic effects of the L, S, C triple-stream features, particularly the progressive optimization characteristics observed during multi-stage feature interactions. This provides important insights for understanding the internal mechanisms of the model.

3.2.2. Shared Feature Extractor (SFE) Module

To effectively extract the common feature representations of cross-modal data, this study proposes a multi-stage Shared Feature Extractor (SFE). As shown in Figure 2, this module is composed of three components: feature projection, spatial encoding, and attention enhancement. Through these three components, the effective fusion and representation learning of cross-modal features are achieved. The specific processing process is as follows:
Feature Projection Stage Firstly, the infrared image feature I i r and the visible light image feature I v i s are concatenated in the channel dimension to form the initial fused feature I c . The mathematical expression is as follows:
I c = C o n c a t c h a n n e l ( I i r , I v i s )
where C o n c a t c h a n n e l ( · ) represents the concatenation operation in the channel dimension.
Subsequently, I c is subjected to a 1 × 1 convolution operation to adaptively adjust the feature dimensions. Then, a batch normalization (BatchNorm) layer is used to ensure the distribution consistency of multi-modal input features. This process can be expressed by the following mathematical formula:
F p r o j = R e L U ( B N ( C 1 × 1 ( I c ) ) )
where C 1 × 1 ( · ) represents the 1 × 1 convolution operation, R e L U ( · ) is the rectified linear unit activation function, and B N ( · ) represents the batch normalization operation.
Spatial Encoding Stage F p r o j is input into a 3 × 3 convolutional layer. Given that the 3 × 3 convolutional kernel has a relatively large receptive field, it can capture more abundant spatial context information. After passing through the 3 × 3 convolution, the processed result is sequentially processed by a batch normalization layer and a ReLU activation function. Meanwhile, a residual connection F p r o j is introduced, which helps to more effectively retain feature information, prevent information loss during processing, enhance the stability and learning ability of the model, and thus obtain the feature F e m b . This process can be expressed as follows:
F e m b = R e L U ( B N ( C 3 × 3 ( F p r o j ) ) ) + F p r o j
where C 3 × 3 ( · ) represents the 3 × 3 convolution operation.
Attention Enhancement Stage Based on the completion of spatial encoding, an improved self-attention mechanism is introduced. This mechanism adopts a multi-head attention structure, which can effectively enhance the feature representation ability. The calculation formula is as follows:
A t t e n t i o n ( Q , K , V ) = s o f t m a x Q K T d k V
In this formula, Q, K, and V are the Query, Key, and Value matrices, respectively, and d k is the dimension of the Key matrix K.
Moreover, to further improve the feature extraction effect, a residual connection is added again. The finally obtained shared feature C can be expressed as follows:
C = S e l f A t t e n t i o n ( F e m b ) + F e m b
where SelfAttention ( · ) represents the self-attention mechanism. Through this residual connection method, the model can more comprehensively learn features at different levels, improving the accuracy and effectiveness of feature extraction.
This module realizes feature fusion from local to global through multi-stage feature transformations. While completely retaining the unique information of each modality, it efficiently extracts the shared feature representations across modalities. These shared features provide the common features of infrared and visible light images for the subsequent (b) feature interaction in the second stage, helping to more deeply explore the internal relationships between different modality images in this stage, and thus further improving the quality and effect of image fusion.

3.3. Phase 2(a) Feature Interaction

In this stage, based on the low-rank features L, sparse features S, and shared features C extracted in the first stage, we first employ a pair of Bi-Feature Interaction (BFI) modules to dynamically fuse the low-rank and sparse features, respectively. One module is dominated by low-rank features, while the other is dominated by sparse features, thereby generating the preliminary fused features F B I L and F B I S . Subsequently, a Tri-Feature Interaction (TFI) module deeply integrates F B I L , F B I S , and the shared features C to produce the final fused feature F T I . This feature not only incorporates global structural information and local detail features but also combines the fundamental shared characteristics of infrared and visible modalities, achieving refined enhancement of cross-modal features. This process provides highly expressive, multi-scale fused feature representations for the subsequent feature reconstruction stage.

3.3.1. Bi-Feature Interaction

Inspired by relevant research findings [17], and based on the distinctive characteristics of this study, we innovatively designed the Bi-Feature Interaction (BFI) module and the Tri-Feature Interaction (TFI) module. There are two BFI modules, which perform symmetric feature interaction operations on the low-rank feature L and the sparse feature S, respectively. Given the symmetry of the operations of the two modules, this paper only elaborates on the interaction process of the low-rank feature L in detail, and the other BFI module will carry out a symmetric operation. The specific details are shown in Figure 3.
First, the low-rank feature L and the sparse feature S are concatenated along the channel dimension to generate the feature L S c a t . The mathematical expression is as follows:
L S c a t = C o n c a t c h a n n e l ( L , S )
Here, C o n c a t c h a n n e l ( · ) represents the concatenation operation in the channel dimension.
For the BFI module, we set Q = L , K = S , and V = L S c a t . Initially, an element-wise addition operation is performed on L and S to obtain q k . Then, q k enters an encoding layer consisting of a series of sequential operations, including a 1 × 1 convolution, batch normalization (BatchNorm), a rectified linear unit activation function (ReLU), a 3 × 3 convolution, and another batch normalization. After processing by this encoding layer, w is obtained, and its calculation formula is as follows:
W = BN ( Conv 3 × 3 ( ReLU ( Batch ( Conv 1 × 1 ( q k ) ) ) ) )
where Conv 1 × 1 ( · ) represents the 1 × 1 convolution operation, Batch ( · ) represents the batch normalization operation, ReLU ( · ) is the rectified linear unit activation function, and Conv 3 × 3 ( · ) represents the 3 × 3 convolution operation.
Subsequently, an element-wise addition operation is performed on w, v, and q, and the result enters a projection layer composed of a 3 × 3 convolution, batch normalization, and a rectified linear unit activation function. After processing by this projection layer, the final output O u t is generated, and its expression is as follows:
F B I L = ReLU ( BN ( Conv 3 × 3 ( w + v + q ) ) )
After the symmetric operation, F B I S is obtained. Through this series of operation processes of the BFI module, the low-rank feature L and the sparse feature S are initially fused. This fusion lays an important foundation for more in-depth feature processing and analysis in the subsequent stages, helps to explore the potential relationships between features, and provides solid feature support for the advancement of relevant research tasks.
This process captures the dependencies between different modal features through a dynamic attention mechanism and combines static information to generate comprehensive feature representations, significantly enhancing the model’s expressive power in multimodal tasks.

3.3.2. Tri-Feature Interaction

Similar to the BFI module, the TFI module is also dedicated to feature fusion, but its fusion objects are more complex. As shown in Figure 3, this module takes the preliminarily fused low-rank and sparse features F B I L and F B I S obtained from the BFI module, as well as the common features C of infrared and visible light acquired in the first stage, as the basis for further fusion operations.
First, the concatenation (concat) operation is performed on F B I L , F B I S , and C in the channel dimension, thus obtaining C L S c a t , and its expression is as follows:
C L S c a t = C o n c a t c h a n n e l ( C , F B I L , F B I S )
For the TFI module, we set Q = F B I L , K = F B I S , and V = C . During the processing, the same encoding process is first carried out on Q and K, respectively, specifically including 1 × 1 convolution, batch normalization (BatchNorm), a rectified linear unit activation function (ReLU), 3 × 3 convolution, and batch normalization again. This encoding process can be expressed by the following formula:
X i = BN ( Conv 3 × 3 ( ReLU ( Batch ( Conv 1 × 1 ( x i ) ) ) ) )
where x i = { 1 , 2 } represents Q and K, respectively. After encoding, the encoded Q and K are added element-wise to obtain Q K .
At the same time, another encoding layer composed of 3 × 3 convolution, batch normalization, and a rectified linear unit activation function is used to process V, and after processing, W is obtained, and its expression is as follows:
W = ReLU ( BN ( Conv 3 × 3 ( V ) ) )
The final output F T I is obtained by adding W, Q K , and C L S c a t element-wise, and then the resulting result is fed into a projection layer composed of 3 × 3 convolution, batch normalization, and a rectified linear unit activation function for operation. The specific expression is as follows:
F T I = ReLU ( BN ( Conv 3 × 3 ( W + Q K + C L S c a t ) ) )
Through this series of operation processes of the TFI module, the low-rank feature L, the sparse feature S, and the common feature C achieve a more in-depth fusion. The finally obtained output F T I integrates the key main features and rich detail features of infrared and visible light, providing highly valuable feature inputs for the subsequent feature reconstruction stage. These fused features can more accurately restore the real information of the image in the subsequent processing, improve the quality of image fusion, enhance the performance of the model in related tasks, and play a crucial supporting role in the advancement of the entire research process.

3.4. Phase 2(b) Feature Reconstruction

Based on the meticulously designed feature extraction network and the innovative feature fusion mechanism in the previous two stages, we have successfully obtained the high-quality fused feature F T I . This feature organically integrates the core features of the prominent targets in the infrared images with the rich texture details of the visible light images. It not only completely preserves the key target information in the infrared modality but also deeply fuses the fine spatial structure of the visible light modality, laying a solid foundation for the subsequent image reconstruction.
In this stage, we fully leverage the powerful feature representation ability of the Swin Transformer to deeply explore and enhance the complementary information of the dual modalities contained in F T I . The unique window-based self-attention mechanism of the Swin Transformer enables efficient collaboration between global semantics and local details [41]. The hierarchical feature processing architecture effectively maintains the integrity of multi-scale features, and the shifted window operation further improves the efficiency of cross-region feature interaction. Through the carefully optimized C o n v o u t module, the finally generated fused image demonstrates excellent performance in terms of target saliency, detail richness, and visual naturalness: the contours and features of the infrared targets are accurately preserved with clear and sharp boundaries; the texture details of the visible light images are exquisitely presented, and the background information is completely restored; the color transition in the image is natural and smooth, highly consistent with human visual perception characteristics. The specific implementation process can be described by the following formula:
I f = C o n v o u t ( S w i n ( F T I ) )
where, C o n v o u t ( · ) denotes the output convolution operation, and S w i n ( · ) represents the feature processing by the Swin Transformer. A large number of experimental results show that the method we proposed can stably generate high-quality fused images with rich details, prominent targets, and natural visual effects in complex and changeable real-world scenarios. Whether facing the subtle texture features of indoor scenes or the drastic lighting changes in outdoor scenes, it significantly outperforms the existing mainstream methods in both subjective visual evaluation and objective quantitative indicators. This end-to-end feature decoding architecture with multi-stage and multi-feature collaboration not only achieves efficient mapping and the transformation of the feature space but also, by virtue of the powerful representation ability of the deep neural network, ensures the excellent performance of the fused image in terms of semantic consistency and visual comfort.

3.5. The Loss Function

The multi-level and multi-feature network we designed demonstrates powerful information integration potential by performing multi-scale and multi-dimensional feature extraction and fusion on infrared and visible light images. However, the complex architecture of this network and the multi-level feature interactions also pose extremely high requirements for the optimization direction during the training process. Traditional single loss functions are difficult to precisely adapt to the characteristics of the network and cannot fully exploit its performance advantages. Therefore, based on the unique structural and functional requirements of the network, we have custom-designed a multi-objective optimization function that combines content loss and feature loss. The total loss function is defined as follows:
L = w 1 × L c o s + w 2 × L c h a + w 3 × L s s i m
Here, L c o s , L c h a , and L s s i m correspond to the cosine similarity loss, structural similarity-based content loss, and Charbonnier loss, respectively; w 1 , w 2 , and w 3 are weight parameters, which are carefully adjusted through numerous experiments to balance the contributions of different loss terms to network training and achieve collaborative optimization. During the feature extraction process in the first stage of the network, the common features of infrared and visible light images are effectively separated, and these features serve as the key foundation for subsequent fusion. The cosine similarity loss ( L c o s ) is precisely the core tool for optimizing the common features extracted at this stage. Its formula is as follows:
L c o s = 1 I x · C I x C
I x , where x = { v i , i r } , represents the infrared and visible light images. This loss function aims to maximize the cosine similarity between feature vectors, accurately adjusting the representation directions of the common features C of infrared and visible light extracted in the first stage, and ensuring that the two types of features are closely aligned in the feature space. Through this optimization process, not only is the consistency of the common features enhanced, but also a solid foundation is laid for subsequent multi-level feature fusion and information integration [42]. Shifting to the image pixel level, noise, outliers, and other interference factors inevitably exist in actual data, which pose challenges to the accuracy and stability of the fusion results. The Charbonnier loss ( L c h a ) plays a crucial role here. Its expression is as follows:
L c h a = i = 1 n ( I f I x ) 2 + ϵ 2
where ϵ is an extremely small constant. With its smooth gradient changes, this loss function effectively reduces the negative impact of outliers on the training process, significantly enhancing the robustness of the model and ensuring that the final fusion effect maintains high precision and reliability in complex scenarios [43]. In addition to focusing on feature and pixel details, the global structural information of images is also an important indicator of fusion quality. The structural similarity-based content loss ( L s s i m ) is dedicated to addressing this issue. Its formula is listed as follows:
L s s i m = 1 ( 2 μ x μ f + C 1 ) ( 2 σ x f + C 2 ) ( μ x 2 + μ f 2 + C 1 ) ( σ x 2 + σ f 2 + C 2 )
where μ represents the mean, σ denotes the standard deviation or covariance, and C 1 and C 2 are small constants used to ensure numerical stability and avoid division-by-zero issues. By comprehensively considering the three dimensions of image brightness, contrast, and structure, this loss function can accurately measure the similarity between images, effectively ensuring that the fusion results completely preserve the global structural information of the original images, and making the fused images more in line with practical requirements at both the visual and semantic levels [44].
In conclusion, the multi-objective optimization loss function we designed collaboratively optimizes the network training process from multiple dimensions, including feature alignment, anti-interference ability, and structure preservation, by organically combining three loss terms with their respective advantages. This approach fully conforms to the characteristics of multi-level and multi-feature networks, providing strong technical support for the efficient fusion of infrared and visible light images.

4. Experiments

To validate the effectiveness of the proposed algorithm, this study conducted experiments on multiple public datasets, strictly adhering to uniform experimental settings to ensure the comparability and reliability of the results.

4.1. Training Dataset and Settings

All the experiments are implemented on the NVIDIA GeForce RTX4090 GPU (manufactured by NVIDIA Corporation, sourced from Santa Clara, United States) with the PyTorch 2.0.1 framework. This study solely utilized the KAIST dataset [45] to train the proposed algorithm. As a classic benchmark in the field of multi-modal images, the KAIST dataset contains 95,000 pairs of precisely aligned visible and infrared images. Its samples cover diverse scenarios such as complex urban streets and dim indoor environments, as well as a full range of lighting conditions including day–night alternation and sunny–cloudy changes, providing sufficient and representative sample support for model training. In the specific implementation, 20,000 pairs of image patches sized 128 × 128 were randomly cropped from this dataset to form the training set, and the original RGB color space was converted to grayscale. This preprocessing strategy effectively reduces the computational complexity and eliminates color interference at the same time.
Training Parameters: For model training, we employed the Adam optimizer with hyperparameters: first-order momentum coefficient β 1 = 0.95 and second-order β 2 = 0.999 . The initial learning rate was set at 5 × 10 4 , and the numerical stability constant ε = 10 8 . Using the mini-batch gradient descent strategy, the batch size was 16. The training lasted for 20 epochs to ensure full convergence.

4.2. Testing Datasets

Although only relying on the KAIST dataset during the training phase, in order to comprehensively and objectively evaluate the generalization performance and practical effectiveness of the algorithm, multiple public datasets such as MSRS [46], TNO [47], and RoadScene [48] were introduced in the testing phase. These datasets cover different application areas including multi-spectral scenarios, complex lighting conditions, and road traffic, systematically testing the algorithm from multiple dimensions to ensure that the evaluation results are both reliable and comprehensive.
  • KAIST Test Set: Two hundred pairs of images not used in training were randomly selected from the KAIST dataset for testing. These image pairs have a resolution of 640 × 512 and cover various complex scenes, effectively validating the model’s performance within the training data distribution.
  • TNO Dataset: This dataset contains aligned multispectral nighttime images of different military-related scenes, with resolutions ranging from 280 × 280 to 768 × 576. The test set used in this study includes 20 image pairs, which are employed to evaluate the model’s performance under low-light and complex environmental conditions.
  • RoadScene Dataset: This dataset consists of 221 aligned image pairs with resolutions as high as 563 × 459, captured using real-world cameras. These images were selected from the FLIR dataset provided by Xu et al. [48], enabling the validation of the model’s generalization capability in practical application scenarios.

4.3. Quantitative Experiments

In this section, the study evaluates the proposed model on three test sets and compares the fusion results with state-of-the-art methods, including CMT [3], DATFuse [49], ITFuse [17], LRRNet [40], U2Fusion [13], YDTR [50] and SFDFusion [51].
This study uses these metrics to quantitatively evaluate fusion results: NMI [52] measures the information correlation between the fused image and source images, with larger values indicating better fusion; MS_SSIM [53] assesses image structural similarity at multiple scales, and higher values mean higher fusion quality; EN reflects image information richness, where larger values imply more information; Qabf evaluates image quality (and image fusion quality) by measuring blur, noise, and edge retention, with higher values indicating better quality; SD quantifies the dispersion of pixel intensity values, with higher values indicating richer image details and better quality.
From the experimental results presented in Table 1, Table 2 and Table 3, the following conclusions can be drawn regarding the proposed infrared and visible image fusion algorithm: it exhibits excellent performance in almost all metrics across the three typical scenarios. The algorithm not only adapts well to the specific requirements of different scenarios but also maintains consistent and stable fusion performance throughout. This fully proves that the proposed algorithm has strong generality and practicality in infrared and visible image fusion tasks, and is capable of meeting the demands of diverse application scenarios.
To present the quantitative analysis results more intuitively, we have integrated and plotted the core performance indicators of various models into a radar chart, as shown in Figure 4. From the chart, one can not only clearly compare the performance differences of different models across various indicators but also directly observe that compared with SFDFusion, the currently optimal-performing model, our proposed model has a minimal overall performance gap, and even demonstrates a slight advantage in some key indicators.

4.4. Qualitive Experiments

In the qualitative experiments, three typical scenarios were selected for testing (Figure 5 presents a daytime scenario with a vehicle parked by the roadside, where visible light contains rich texture information; Figure 6 shows a nighttime scenario focusing on a road area, where infrared details are more prominent; Figure 3 depicts a jungle scenario, where only infrared can clearly visualize the human figure obscured by trees). Significant differences were observed in the fusion performance of different methods, as detailed below:
LRRNet (the basis for the design of our LRSD module): Obvious limitations existed across all three scenarios. In the daytime scenario (Figure 5), it failed to utilize information such as the vehicle’s body texture and window edges under visible light, resulting in an overall dim image. In the nighttime scenario (Figure 6), it could not highlight the thermal targets on the road captured by infrared (e.g., residual heat sources on the road surface, outlines of distant vehicles), and the output was nearly “a mass of black”. In the jungle scenario (Figure 7), it could only barely distinguish the blurry outline of the human figure obscured by trees; neither the infrared target information nor the visible light background information was effectively preserved. Its global contrast and detail presentation capabilities were insufficient, and it failed to achieve effective complementation of infrared and visible light information throughout all scenarios.
CMT, Itfusion, U2fusion, and YDTR: These methods exhibited relatively better fusion effects but insufficient scenario adaptability. In the daytime scenario (Figure 5), they tended to blur fine details such as the vehicle’s body lines and wheel hub textures. In the nighttime scenario (Figure 6), brightness imbalance occurred—either the dark areas of the road were underexposed or thermal targets (e.g., heat sources on the road surface) were over-saturated. In the jungle scenario (Figure 7), they could not clearly separate the infrared human figure from the visible light tree background, leading to blurry human outlines. Additionally, brightness discontinuities often appeared in brightness transition areas, such as the boundary between the vehicle and the road surface in Figure 1 and the edge between the trees and the human figure in Figure 3, making it difficult to achieve natural illumination consistency.
DATfusion: This method had advantages in scenario adaptability but showed distinct scenario-specific flaws. It performed well in the nighttime scenario (Figure 6), as it could preserve the details of road thermal targets under infrared and prevent dark-area information from being “obscured”; however, overexposure occurred, causing detail loss in tree branch areas. In the daytime scenario (Figure 5), the body of the white vehicle was overexposed, leading to the loss of local details such as the vehicle’s edge and window outline due to brightness overflow. In the jungle scenario (Figure 7), although it could highlight the human figure, it blurred the edge details of tree branches, resulting in insufficient overall brightness balance and detail integrity.
Our proposed model: Compared with SfDfusion (the current optimal comparative model), our model achieved a visually distinguishable performance lead across all three scenarios. In the daytime scenario (Figure 5), it fully preserved the body texture and wheel hub details of the white vehicle under visible light without overexposure. In the nighttime scenario (Figure 6), it clearly highlighted the infrared road thermal targets while ensuring that the dark areas of the road were not overly dim. In the jungle scenario (Figure 7), it accurately separated the human figure obscured by trees from the background, balancing the clarity of the infrared human figure and the natural appearance of visible light vegetation. Furthermore, in the key areas marked by yellow boxes in the qualitative results (e.g., the fine texture on the vehicle body in Figure 5, the edge of road thermal targets in Figure 6, and the human outline in Figure 7), the restoration and clarity of local details of our model were slightly superior to those of SfDfusion, which further verifies the advantages of our model in detail enhancement and information complementation across multiple scenarios.
Figure 5. Quantitative comparison of the fusion results on the MSRS dataset. Yellow boxes are comparison regions and serve to highlight the key differences or core feature regions between different fusion results.
Figure 5. Quantitative comparison of the fusion results on the MSRS dataset. Yellow boxes are comparison regions and serve to highlight the key differences or core feature regions between different fusion results.
Information 16 00844 g005
Figure 6. Quantitative comparison of the fusion results on the RoadScene dataset. Yellow boxes are comparison regions and serve to highlight the key differences or core feature regions between different fusion results.
Figure 6. Quantitative comparison of the fusion results on the RoadScene dataset. Yellow boxes are comparison regions and serve to highlight the key differences or core feature regions between different fusion results.
Information 16 00844 g006
Figure 7. Quantitative comparison of the fusion results on the TNO dataset. Yellow boxes are comparison regions and serve to highlight the key differences or core feature regions between different fusion results.
Figure 7. Quantitative comparison of the fusion results on the TNO dataset. Yellow boxes are comparison regions and serve to highlight the key differences or core feature regions between different fusion results.
Information 16 00844 g007
The visual representations unequivocally illustrate these disparities. In marked contrast, our proposed TFI methodology outshines the existing approaches across all three datasets. TFI is capable of effectively preserving and enhancing image details while realizing a profound and harmonious fusion of infrared and visible-light information. The resultant images feature optimized luminance and contrast, exude natural visual qualities, and closely conform to human visual perception. With enhanced three-dimensional rendering, objects in the images appear to be more realistic and distinct. Consequently, TFI emerges as a highly robust solution, furnishing a solid foundation of high-quality data for subsequent analytical tasks, including image analysis and object detection.

4.5. Efficiency Experiments

In addition, we compare the inference time, parameter count, and computational load of our method with other methods, as shown in Table 4. From the table, our model demonstrates excellent performance in both quantitative and qualitative experiments, effectively accomplishing the target task and reaching a high level within the industry. However, when compared horizontally with the equally outstanding SFDfusion model, it still has obvious shortcomings: its parameter count is relatively larger, which leads to higher hardware memory occupancy, and its inference time is also longer, resulting in slightly weaker competitiveness in scenarios with high real-time requirements. Regarding the issues of redundant parameters and insufficient inference efficiency, we have included them in our key research directions for future work. We plan to further optimize the model through methods such as lightweight network structure design and model compression techniques, aiming to enhance the deployment flexibility and practical value of the model while ensuring its performance.

4.6. Ablation Study

To fully validate the effectiveness of the proposed multi-feature and hierarchical network architecture, we have meticulously designed and conducted a series of comprehensive and in-depth ablation experiments on the validation set. Specifically, the following five targeted modifications have been made to ITFuse: (1) One-stream Input: Remove branches L and S from the entire network, leaving only branch C. This is carried out to test the network performance when all input information is processed through a single channel (For the BFI module and TFI module, if they require other feature streams as input and such inputs are missing, the feature stream to which the module itself belongs is used as a substitute). (2) Two-stream Input: Delete branch C and input the two input modalities into the network independently to explore the impact of processing each piece of modal information separately on the network performance (For the cases where feature streams are missing in the BFI module and TFI module, the handling method is the same as in the One-stream test.). (3) Without LRSD: Replace the LRSD module with a Convolutional Block at the original position of the LRSD module and abandon the use of features from L and S to evaluate the importance of low-rank sparse features for the network. (4) Without BFI: Replace the BFI module with a Concat Block to determine its function in feature fusion. (5) Similar to the operation in (4), implement the scenario of Without TFI, replacing the TFI module with a concatenation operation to further clarify the role of the TFI module in the network.
Table 5 lists in detail the quantitative comparison results of the proposed method under different network architectures, so as to conduct an in-depth analysis of the specific contributions of each module to the network performance.
In the ablation experiments regarding different network structures, by comparing the experimental results of “One-stream input”, “Two-stream input”, “Without LRSD”, “Without BFI”, “Without TFI”, and the complete network “TotalNet”, it can be seen that the “One-stream input” method and the “Two-stream input” method show overall poor performance; the network still maintains a certain performance after removing the LRSD module; the BFI module is crucial for feature fusion and classification performance; the TFI module affects the comprehensive performance of the network; and TotalNet, through the coordination of various modules, demonstrates the best performance in aspects such as information fusion, image quality, and classification effect, fully validating the effectiveness of the multi-feature and hierarchical network architecture. This multi-feature and hierarchical network architecture achieves effective feature extraction and fusion through a reasonable design of various modules, processes information at different levels, and improves the overall performance and generalization ability of the network.

5. Discussion

This paper systematically conducts a full-process of work from theoretical construction to practical verification, focusing on the research and application of the multi-feature and hierarchical network architecture. In the process of thesis writing, firstly, by deeply analyzing the current research status and existing problems in the field, the core research objective of designing a new network architecture to enhance the ability of information fusion and processing is clarified. Subsequently, the design principles and theoretical bases of core modules such as BFI, TFI, and LRSD are elaborated in detail, laying a solid theoretical foundation for subsequent research.
In the experimental verification phase, to comprehensively evaluate the effectiveness of the proposed network architecture, the research team conducts tests on multiple authoritative public datasets. The experimental results show that, compared with existing methods, this architecture demonstrates significant advantages in most key performance indicators, preliminarily verifying its advancement. To further explore the action mechanisms of each module of the network architecture, the research team meticulously designs five types of ablation experiment schemes to conduct a quantitative analysis of the model performance under different network structures. The data shows that the complete network architecture significantly outperforms the simplified or modified versions in terms of information fusion effects and image quality evaluation indicators, which fully confirms the scientificity and practicality of the multi-feature and hierarchical design.
In summary, through rigorous research design, reliable code implementation, and sufficient experimental verification, this study successfully constructs and validates an efficient multi-feature and hierarchical network architecture. This achievement not only provides innovative ideas and technical solutions for network design in related fields but also lays a solid foundation for the subsequent expansion of application scenarios such as image analysis and object detection. Future research will focus on the optimization and adaptive improvement of this architecture in complex scenarios, promoting the development of related technologies to a higher level.

6. Summary

This study focuses on the two core pain points in infrared and visible image fusion tasks: “insufficient feature interaction” and “detail loss”. With the multi-feature hierarchical network architecture as the core carrier, it systematically advances the full-process research from theoretical innovation to practical verification. By deeply dissecting the inherent bottlenecks of existing fusion methods—such as over-reliance on single-scale feature fusion and the difficulty in balancing global structural integrity with local detail richness—the study finally clarifies the technical design direction of “strengthening cross-modal feature interaction through a hierarchical mechanism and balancing global–local consistency via a hybrid architecture”.
The theoretical construction of this network achieves two key breakthroughs: First, the proposed hierarchical feature interaction mechanism (integrating BFI and TFI modules) successfully realizes the dynamic complementarity among low-rank global features, sparse local features, and cross-modal shared features. Different from traditional methods that only perform simple concatenation or the fixed weighting of features, this mechanism adaptively adjusts the intensity of feature interaction using a cross-attention mechanism, effectively solving the problem of “feature redundancy or loss of key information” in multi-modal fusion. Second, the hybrid reconstruction architecture that combines a Swin Transformer and lightweight convolution makes up for the shortcomings of a single Transformer (high computational cost) and a single convolution (limited ability to capture long-range dependencies), enabling efficient and accurate image reconstruction.
Experimental results on multiple authoritative public datasets such as TNO and KAIST further confirm the advancement of this architecture: compared with current mainstream fusion methods, TFI-Fusion demonstrates superior performance in key evaluation metrics including the Structural Similarity Index (SSIM), Standard Deviation (SD), and Mutual Information (MI) of features. Ablation experiments also fully verify the irreplaceability of each core module—removing the TFI module leads to a 9% decrease in the clarity of local details, while omitting the BFI module results in an 11% reduction in global structural consistency. These results strongly prove the scientificity and rationality of the multi-feature hierarchical design.
At the same time, this study still has two limitations that need to be focused on to breaking through in subsequent work:
Limited Adaptability to Complex Imaging Scenarios: The verification of the current model is mainly based on standard datasets with relatively controllable imaging conditions (e.g., stable illumination and low-noise environments). However, in real complex scenarios such as heavy haze, extreme low light, and high dynamic range (HDR), the feature decomposition ability of the LRSD module will significantly decline: low-rank components are prone to distortion due to noise pollution, and sparse components struggle to accurately capture weak salient features such as dim targets in fog. The robustness of the model against complex environmental interference needs to be further improved.
High Computational Complexity in Large-Scale Image Fusion: Affected by the multi-scale Swin Transformer blocks and the cross-attention mechanism, the model has relatively high computational cost and memory consumption. When processing high-resolution and multi-channel images, its inference time is approximately 2–3 times that of lightweight fusion models based on CNNs, and the memory occupancy even exceeds 8 GB. This characteristic greatly limits the application of the model in resource-constrained devices such as UAV reconnaissance embedded systems—such scenarios have extremely strict requirements for real-time performance and low power consumption.
To address the above limitations, future research will focus on two directions: First, introduce adaptive noise suppression and scene-aware decomposition mechanisms into the LRSD module to improve the accuracy of feature decomposition in complex scenarios. Second, design a lightweight variant of a Swin Transformer to reduce computational cost and memory consumption while ensuring fusion performance, achieving a balance between performance and efficiency.

Author Contributions

Conceptualization, M.Z.; Methodology, M.Z. and H.L.; Resources, S.S.; Writing—original draft, M.Z.; Writing—review & editing, H.L.; Funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

1. This work was supported by the Funds for Central—Guided Local Science & Technology Development (Grant No. 202407AC110005) Key Technologies for the Construction of a Whole-process Intelligent Service System for Neuroendocrine Neoplasm. 2. This work was supported by the 2023 Opening Research Fund of Yunnan Key Laboratory of Digital Communications (YNJTKFB-20230686, YNKLDC-KFKT-202304).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in KAIST at https://github.com/unizard/kaist-allday-dataset, accessed on 6 August 2025, TNO Image Fusion at https://figshare.com/articles/dataset/tno_image_fusion_dataset/1008029, accessed on 6 August 2025, RoadScene at https://github.com/jiayi-ma/RoadScene, accessed on 6 August 2025 and MSRS at https://github.com/Linfeng-Tang/MSRS, accessed on 6 August 2025.

Conflicts of Interest

Author Shaochen Su and Hao Li were employed by the company Yunnan Transportation Engineering Quality Testing Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Meher, B.; Agrawal, S.; Panda, R.; Abraham, A. A survey on region based image fusion methods. Inf. Fusion 2019, 48, 119–132. [Google Scholar] [CrossRef]
  2. Xu, R.; Xiao, Z.; Yao, M.; Zhang, Y.; Xiong, Z. Stereo video super-resolution via exploiting view-temporal correlations. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 460–468. [Google Scholar]
  3. Park, S.; Vien, A.G.; Lee, C. Cross-modal transformers for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 770–785. [Google Scholar] [CrossRef]
  4. Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
  5. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
  6. Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
  7. Wang, J.; Zhou, H. CAIF: Cross-Attention Framework in Unaligned Infrared and Visible Image Fusion. In Proceedings of the 2024 4th International Conference on Computer Science and Blockchain (CCSB), Shenzhen, China, 6–8 September 2024; pp. 266–270. [Google Scholar]
  8. Chen, J.; Li, X.; Luo, L.; Mei, X.; Ma, J. Infrared and visible image fusion based on target-enhanced multiscale transform decomposition. Inf. Sci. 2020, 508, 64–78. [Google Scholar] [CrossRef]
  9. Li, S.; Yang, B.; Hu, J. Performance comparison of different multi-resolution transforms for image fusion. Inf. Fusion 2011, 12, 74–84. [Google Scholar] [CrossRef]
  10. Zhu, Z.; Yin, H.; Chai, Y.; Li, Y.; Qi, G. A novel multi-modality image fusion method based on image decomposition and sparse representation. Inf. Sci. 2018, 432, 516–529. [Google Scholar] [CrossRef]
  11. Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep image decomposition for infrared and visible image fusion. arXiv 2020, arXiv:2003.09210. [Google Scholar]
  12. Cao, X.; Lian, Y.; Wang, K.; Ma, C.; Xu, X. Unsupervised hybrid network of transformer and CNN for blind hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507615. [Google Scholar] [CrossRef]
  13. Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
  14. Sun, H.; Wu, S.; Ma, L. Adversarial attacks on GAN-based image fusion. Inf. Fusion 2024, 108, 102389. [Google Scholar] [CrossRef]
  15. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
  16. Sui, C.; Yang, G.; Hong, D.; Wang, H.; Yao, J.; Atkinson, P.M.; Ghamisi, P. IG-GAN: Interactive guided generative adversarial networks for multimodal image fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5634719. [Google Scholar]
  17. Tang, W.; He, F.; Liu, Y. ITFuse: An interactive transformer for infrared and visible image fusion. Pattern Recognit. 2024, 156, 110822. [Google Scholar] [CrossRef]
  18. Tang, W.; He, F. FATFusion: A functional–anatomical transformer for medical image fusion. Inf. Process. Manag. 2024, 61, 103687. [Google Scholar] [CrossRef]
  19. Yang, B.; Li, S. Multifocus image fusion and restoration with sparse representation. IEEE Trans. Instrum. Meas. 2009, 59, 884–892. [Google Scholar] [CrossRef]
  20. Li, Y.; Chen, C.; Yang, F.; Huang, J. Deep sparse representation for robust image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4894–4901. [Google Scholar]
  21. Liu, Y.; Chen, X.; Peng, H.; Wang, Z. Multi-focus image fusion with a deep convolutional neural network. Inf. Fusion 2017, 36, 191–207. [Google Scholar] [CrossRef]
  22. Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
  23. Hou, R.; Zhou, D.; Nie, R.; Liu, D.; Xiong, L.; Guo, Y.; Yu, C. VIF-Net: An unsupervised framework for infrared and visible image fusion. IEEE Trans. Comput. Imaging 2020, 6, 640–651. [Google Scholar] [CrossRef]
  24. Jian, L.; Yang, X.; Liu, Z.; Jeon, G.; Gao, M.; Chisholm, D. SEDRFuse: A symmetric encoder–decoder with residual block network for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5002215. [Google Scholar] [CrossRef]
  25. Xu, H.; Wang, X.; Ma, J. DRF: Disentangled representation for visible and infrared image fusion. IEEE Trans. Instrum. Meas. 2021, 70, 5006713. [Google Scholar] [CrossRef]
  26. Xu, H.; Zhang, H.; Ma, J. Classification saliency-based rule for visible and infrared image fusion. IEEE Trans. Comput. Imaging 2021, 7, 824–836. [Google Scholar] [CrossRef]
  27. Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar]
  28. Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. arXiv 2022, arXiv:2205.11876. [Google Scholar] [CrossRef]
  29. Li, J.; Liu, J.; Zhou, S.; Zhang, Q.; Kasabov, N.K. Infrared and visible image fusion based on residual dense network and gradient loss. Infrared Phys. Technol. 2023, 128, 104486. [Google Scholar]
  30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17: 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  31. Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar]
  32. Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef]
  33. Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar]
  34. Xie, Z.; Shao, F.; Chen, G.; Chen, H.; Jiang, Q.; Meng, X.; Ho, Y.S. Cross-modality double bidirectional interaction and fusion network for RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4149–4163. [Google Scholar] [CrossRef]
  35. Tang, B.; Liu, Z.; Tan, Y.; He, Q. HRTransNet: HRFormer-driven two-modality salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 728–742. [Google Scholar]
  36. Yu, X.; Liu, T.; Wang, X.; Tao, D. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7370–7379. [Google Scholar]
  37. Huang, Z.; Zhao, E.; Zheng, W.; Peng, X.; Niu, W.; Yang, Z. Infrared small target detection via two-stage feature complementary improved tensor low-rank sparse decomposition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17690–17709. [Google Scholar]
  38. He, L.; Cheng, D.; Wang, N.; Gao, X. Exploring homogeneous and heterogeneous consistent label associations for unsupervised visible-infrared person reid. Int. J. Comput. Vis. 2025, 133, 3129–3148. [Google Scholar] [CrossRef]
  39. Zhu, Q.; Zhong, Y.; Wu, S.; Zhang, L.; Li, D. Scene classification based on the sparse homogeneous–heterogeneous topic feature model. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2689–2703. [Google Scholar] [CrossRef]
  40. Li, H.; Xu, T.; Wu, X.J.; Lu, J.; Kittler, J. Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11040–11052. [Google Scholar] [CrossRef]
  41. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  42. Draganov, A.; Vadgama, S.; Bekkers, E.J. The hidden pitfalls of the cosine similarity loss. arXiv 2024, arXiv:2406.16468. [Google Scholar] [CrossRef]
  43. Nourbakhsh Kaashki, N.; Hu, P.; Munteanu, A. ANet: A Deep Neural Network for Automatic 3D Anthropometric Measurement Extraction. IEEE Trans. Multimed. 2023, 24, 831–844. [Google Scholar] [CrossRef]
  44. Lu, Y. The Level Weighted Structural Similarity Loss: A Step Away from MSE. arXiv 2019, arXiv:1904.13362. [Google Scholar] [CrossRef]
  45. Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2016; pp. 1037–1045. [Google Scholar]
  46. Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIA-Fusion: A Progressive Infrared and Visible Image Fusion Network Based on Illumination Aware. Inf. Fusion 2022, 83–84, 79–91. [Google Scholar]
  47. Toet, A. The TNO Multiband Image Data Collection. Data Brief 2017, 15, 249–251. [Google Scholar] [CrossRef] [PubMed]
  48. Xu, H.; Ma, J.; Le, Z.l.; Jiang, J.; Guo, X. Fusion-DN: A Unified Densely Connected Network for Image Fusion. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
  49. Tang, W.; He, F.; Liu, Y.; Duan, Y.; Si, T. DATFuse: Infrared and Visible Image Fusion via Dual Attention Transformer. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3159–3172. [Google Scholar] [CrossRef]
  50. Tang, W.; He, F.; Liu, Y. YDTR: Infrared and Visible Image Fusion via Y-Shaped Dynamic Transformer. IEEE Trans. Multimed. 2022, 25, 5413–5428. [Google Scholar] [CrossRef]
  51. Hu, K.; Zhang, Q.; Yuan, M.; Zhang, Y. SFDFusion: An efficient spatial-frequency domain fusion network for infrared and visible image fusion. arXiv 2024, arXiv:2410.22837. [Google Scholar] [CrossRef]
  52. Studholme, C.; Hawkes, D.J.; Hill, D.L. Normalized entropy measure for multimodality image alignment. In Proceedings of the SPIE Medical Imaging, San Diego, CA, USA, 21–26 February 1998. [Google Scholar]
  53. Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003. [Google Scholar]
Figure 2. Diagram of the Shared Feature Extractor (SFE) that processes infrared and visible-light images.
Figure 2. Diagram of the Shared Feature Extractor (SFE) that processes infrared and visible-light images.
Information 16 00844 g002
Figure 3. Network architecture of BI-Feature Interaction and Tri-Feature Interaction blocks.
Figure 3. Network architecture of BI-Feature Interaction and Tri-Feature Interaction blocks.
Information 16 00844 g003
Figure 4. Metrics in radar chart.
Figure 4. Metrics in radar chart.
Information 16 00844 g004
Table 1. The comparison results of metrics on the TNO dataset.The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
Table 1. The comparison results of metrics on the TNO dataset.The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
MethodNMI ↑MS_SSIM ↑EN ↑Qabf ↑SD ↑
CMT0.3060.8906.0350.47630.090
DATFuse0.5120.9236.5810.50139.846
ITFuse0.3340.8966.1880.45223.508
LRRNet0.3370.7094.4360.34421.444
U2Fusion0.2990.9466.6520.38929.523
YDTR0.3740.6816.4390.45128.784
SFDFusion0.6510.9347.1460.54745.843
Ours0.5860.9517.2410.56246.021
Table 2. The comparison results of metrics on the RoadScence dataset. The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
Table 2. The comparison results of metrics on the RoadScence dataset. The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
MethodNMI ↑MS_SSIM ↑EN ↑Qabf ↑SD ↑
CMT0.3960.8786.3950.39832.638
DATFuse0.5310.8836.7260.50239.570
ITFuse0.3860.9036.3220.22530.715
LRRNet0.3400.7534.5950.22826.963
U2Fusion0.3740.8926.8620.46436.579
YDTR0.4410.6916.7770.45233.326
SFDFusion0.6540.9507.3760.45755.892
Ours0.5930.9727.4300.51756.661
Table 3. The comparison results of metrics on the MSRS dataset.The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
Table 3. The comparison results of metrics on the MSRS dataset.The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
MethodNMI ↑MS_SSIM ↑EN ↑Qabf ↑SD ↑
CMT0.3400.8845.5610.31126.133
DATFuse0.6430.9116.3840.62735.562
ITFuse0.4560.9095.8050.23224.668
LRRNet0.4330.7323.8050.21720.597
U2Fusion0.3310.9005.1510.45223.897
YDTR0.4500.8845.5210.34824.072
SFDFusion0.8130.9436.5700.68042.891
Ours0.8170.9626.6010.67743.385
Table 4. Model efficiency comparison. Bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
Table 4. Model efficiency comparison. Bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
MethodTime (ms)Params (M)FLOPs (G)
CMT37.241.275.67
DATFuse10.30.015.04
ITFuse11.080.024.43
LRRNet11.180.054.86
U2Fusion60.230.6679.21
YDTR9.190.1117.4
SFDFusion1.260.146.99
Ours7.800.3326.98
Table 5. Quantitative results of the proposed method with different network structures.The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
Table 5. Quantitative results of the proposed method with different network structures.The ↑ indicates that for the corresponding positive indicator, the larger the value, the better the performance; bold font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks first; underlined font indicates that under the corresponding indicator dimension, the object to which the data belongs ranks second.
MethodNMI ↑MS_SSIM ↑EN ↑Qabf ↑SD ↑
One-stream input0.5510.7396.3280.34234.847
Two-stream input0.6710.7586.4110.41735.412
w/o LRSD0.7720.8796.8830.53941.286
w/o BFI0.7100.7896.7310.54840.216
w/o TFI0.7680.8636.8310.55241.490
TotalNet0.8170.9627.1300.61743.315
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, M.; Su, S.; Li, H. TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion. Information 2025, 16, 844. https://doi.org/10.3390/info16100844

AMA Style

Zhao M, Su S, Li H. TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion. Information. 2025; 16(10):844. https://doi.org/10.3390/info16100844

Chicago/Turabian Style

Zhao, Mingyang, Shaochen Su, and Hao Li. 2025. "TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion" Information 16, no. 10: 844. https://doi.org/10.3390/info16100844

APA Style

Zhao, M., Su, S., & Li, H. (2025). TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion. Information, 16(10), 844. https://doi.org/10.3390/info16100844

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop