SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection

Xu, Guoxia; Sun, Yulong; Chen, Kang; Yu, Yufeng; Deng, Lizhen; Zhu, Hu

doi:10.3390/math14060969

Open AccessArticle

SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection

by

Guoxia Xu

¹,

Yulong Sun

¹,

Kang Chen

¹,

Yufeng Yu

²,

Lizhen Deng

¹ and

Hu Zhu

^1,*

¹

School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

²

Department of Statistics, Guangzhou University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(6), 969; https://doi.org/10.3390/math14060969

Submission received: 7 January 2026 / Revised: 26 February 2026 / Accepted: 3 March 2026 / Published: 12 March 2026

(This article belongs to the Special Issue Applications of Artificial Intelligence, Machine Learning and Data Science)

Download

Browse Figures

Versions Notes

Abstract

While infrared–visible image object detection exhibits advantages in complex scenarios, existing methods still suffer from issues such as static frequency-domain modeling, severe cross-modal interference, and insufficient local detail perception. To address these problems, this paper proposes an infrared and visible image object detection based on a spatial-channel cross-frequency guided fusion network. First, we construct a frequency residual selective transformer to realize local inductive bias and a global receptive field for infrared and visible image feature extraction. Furthermore, the spatial-channel frequency fusion mechanism based on the Homogeneous Frequency Refined Block and the Heterogeneous Spatial-Channel Frequency Fusion Block is proposed to achieve modality-consistent feature fusion. Finally, the frequency reconstruction guided decoder selects high-frequency components to sharpen object boundaries. The SCFusion network was evaluated on four public benchmark datasets (VT821, VT1000, VT5000, and VI-RGBT1500), with test sets containing 411, 400, 2500, and 600 pairs of infrared–visible images, respectively. Following the standard training and testing protocol, the network achieved Mean Absolute Error (MAE) scores of 0.025 (VT821), 0.016 (VT1000), 0.023 (VT5000), and 0.020 (VI-RGBT1500).

Keywords:

infrared–visible image; object detection; spatial-channel frequency fusion

MSC:

68T07

1. Introduction

Infrared–visible image object detection has emerged as a pivotal technology in computer vision, serving as the cornerstone for safety-critical applications such as remote sensing [1], autonomous driving, and search-and-rescue missions [2,3]. In real-world environments, single-modal sensors often face insurmountable physical limitations. For instance, visible sensors are easily compromised by low illumination, severe weather (e.g., fog, haze), or lens flare, while thermal sensors, despite their robustness to lighting changes, lack textural details and color information, making them indistinguishable in thermal crossover scenarios [4]. By integrating the rich appearance details from the visible spectrum with the thermal radiation signatures from the infrared spectrum, infrared and visible systems can achieve robust perception across all-weather and all-time conditions. Despite significant advancements, constructing a high-performance infrared and visible detector remains challenging due to inherent spectral discrepancies and the complexity of feature interactions. Existing methods still face critical challenges regarding static frequency modeling, modal interference, and detail preservation.

Traditional infrared and visible image object detection relies on deep feature extraction paradigms, primarily dominated by Convolutional Neural Networks (CNNs) and Vision Transformers. Early approaches typically employed two-stream CNN architectures to extract modality-specific features, which were subsequently integrated via feature concatenation or element-wise summation strategies [5,6]. While these CNN-based methods excel at capturing local patterns, their limited receptive fields hinder the modeling of global context. To address this, transformer-based architectures have emerged as a powerful alternative, leveraging self-attention mechanisms to capture long-range dependencies across the entire image [7]. For instance, Cross-Modality Fusion Transformers [8] utilize inter-modal cross-attention to dynamically model the interactions between visible and thermal features. However, most of these dominant backbones operate exclusively in the spatial domain, often overlooking the distinct spectral characteristics inherent in the frequency domain [9].

A critical problem in current methods is their reliance on static frequency modeling. While frequency analysis excels at capturing global context [10], traditional approaches—and even learnable ones like GFNet [11]—apply the same spectral filter across the entire image. This content-agnostic strategy is fundamentally flawed for fusion tasks, where scenes are complex and non-stationary. For instance, a single fixed kernel cannot simultaneously preserve the smooth thermal background (low frequency) and sharpen the edges of a small target (high frequency). This inevitably leads to a trade-off: either blurring fine details or amplifying noise. Furthermore, while dynamic weight adaptation has revolutionized spatial convolution [12], its potential for frequency-domain fusion remains untapped, leaving a significant gap in capturing fine-grained spectral dependencies.

A more serious problem lies in the severe cross-modal interference stemming from the fundamental disparity between visible and infrared imaging mechanisms. Since these sensors capture distinct physical attributes—reflectance versus thermal emissivity—their feature distributions are inherently inconsistent. Standard fusion methods (e.g., simple concatenation or AFF [13,14]) often overlook this incompatibility, treating both modalities as equally informative at all times. This indiscriminate integration is perilous, especially in challenging scenarios [15]. For instance, when the visible stream is swamped by noise in low-light conditions, rigid fusion mechanisms act as a conduit, blindly propagating this corruption into the fused representation. Instead of leveraging the complementary thermal data, the model allows the degraded modality to pollute the final features, drowning out critical target signals and severely compromising detection reliability.

Furthermore, we confront the challenge of detail preservation. Deep networks inherently trade spatial precision for semantic abstraction, often eroding the fine edges needed for accurate bounding box regression [16]. Existing frequency-based methods often worsen this by applying blanket filters that smooth out high-frequency phase signals—the very components responsible for defining object boundaries. In challenging scenarios like partial occlusion or low contrast, this leads to "blur" feature representations: the model detects the object category but fails to pinpoint its exact limits. Addressing this requires abandoning uniform spectral filtering and instead adopting a strategy that surgically enhances high-frequency details.

To address the above issues, this paper proposes a novel infrared and visible image object detection algorithm based on a spatial-channel cross-frequency guided fusion network shown in Figure 1. Our approach fundamentally shifts the modeling paradigm from static global filtering to learnable selective filter generation. Specifically, we introduce an infrared and visible frequency residual selective transformer that constructs learnable selective filter generation, enabling the network to perceive diverse frequency responses tailored to spatial features. To mitigate modal interference, we propose the Homogeneous Frequency Refined Block and Heterogeneous Spatial-Channel Frequency Fusion Block, which leverages a dual-branch mechanism to dynamically re-weight features based on their spatial and channel contexts. Finally, the Frequency Reconstruction Guided Module for the decoder is designed to select high-frequency components to sharpen object boundaries. Extensive experiments on four benchmark datasets (VT821, VT1000, VT5000, VI-RGBT1500) demonstrate that our method significantly outperforms state-of-the-art comparison methods.

The main contributions of this work are summarized as follows:

We propose the frequency residual selective transformer (FREFormer), which fundamentally shifts the frequency modeling paradigm from static global filtering to adaptive selective filter generation. By dynamically re-weighting frequency components conditioned on spatial features, the FREFormer effectively captures non-stationary visual patterns that traditional FFT-based token mixers often overlook.
We devise a dual-branch fusion strategy to resolve cross-modal conflicts. The Homogeneous Frequency Refined Block (HFRB) employs a global–local calibration mechanism to preserve intra-modal textures, while the Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB) leverages decoupled amplitude and phase processing in the frequency domain to suppress modal interference and align heterogeneous features.
To address the issue of blurred boundaries in infrared–visible detection, we propose a Frequency Reconstruction Guided Module (FRGM). Integrated into the decoder, it acts as a multi-scale frequency filter to recover high-frequency edge details. This is further supervised by a novel Frequency Fidelity Loss, ensuring precise boundary regression.
Extensive evaluations on the VT821, VT1000, VT5000, and VI-RGBT1500 datasets validate the effectiveness of our proposed method. On the challenging VI-RGBT1500 dataset, specifically, we reduced the Mean Absolute Error (MAE) by 9.1% (from 0.022 to 0.020), maintaining good boundary accuracy and stability in complex scenarios. Furthermore, our method consistently maintains state-of-the-art performance, improving the E-measure by 2.3% on VT821 and the $F_{β}$ score by 1.5% on the large-scale VT5000 dataset. Compared with leading methods, our approach achieves an optimal balance between parameter efficiency and detection performance.

The remainder of this paper is organized as follows: Section 2 reviews the relevant literature. Section 3 presents the details of the proposed network architecture. Section 4 presents the experimental evaluation and analysis. Section 5 concludes this paper.

2. Related Work

2.1. CNN/Transformer-Based Infrared and Visible Image Object Detection

CNN-based methods have been extensively studied in infrared and visible image object detection due to their strong capability in local feature extraction and computational efficiency. Early works mainly focus on designing effective fusion strategies to integrate visible and infrared information. Kang et al. [17] proposed a global–local feature fusion framework based on recurrent structures to iteratively aggregate cross-modal features, which improves modal complementarity but struggles to model long-range dependencies due to the inherent locality of convolution operations. To enhance feature discrimination, Yang et al. [18] introduced a cascaded information enhancement module that refines fused features in the channel dimension. Although this approach improves inter-channel representation, it implicitly assumes similar spectral characteristics across modalities and ignores the intrinsic frequency distribution differences between visible and infrared data. Huo et al. [19] developed a context-guided stacked refinement network to progressively optimize salient target representations. However, its reliance on static convolution kernels limits adaptability to complex scenes with heterogeneous frequency components, especially in scenarios involving small targets, low contrast, or cluttered backgrounds. To capture broader contextual information, Cai et al. [20] utilized feature correlations to model long-range dependencies. However, this process often involves downsampling to reduce computational cost, inevitably sacrificing fine-grained details. Addressing the rigidity of fixed kernels, Huang et al. [21] proposed a dynamic receptive field attention module to adaptively adjust feature extraction. Yet, its effective field of view remains constrained by the local nature of convolutional operations, limiting true global modeling. Overall, while CNN-based approaches are effective in capturing local structures and fine details, their limited receptive field and static kernel design restrict their ability to adaptively model global context and frequency-varying characteristics across modalities, motivating the exploration of more flexible modeling paradigms.

Transformer-based methods have recently attracted increasing attention in infrared and visible image object detection due to their powerful global context modeling ability enabled by self-attention mechanisms. By establishing long-range dependencies among feature tokens, transformers can effectively enhance global semantic consistency across modalities. Liu et al. [22] proposed SwinNet, which introduces a hierarchical Swin Transformer to capture edge-aware representations in RGB-T data. Although effective, its high computational complexity and memory overhead pose challenges for real-time deployment. To explicitly model cross-modal interactions, Shen et al. [23] proposed ICAFusion, which adopts iterative cross-attention to facilitate information exchange between visible and infrared features. While this strategy improves modal interaction, it operates purely in the spatial domain and lacks fine-grained frequency-aware modulation, making it less effective in preserving local details. Xiao et al. [24] introduced GM-DETR, a generalized multispectral detection transformer that enhances fusion efficiency through attention-based mechanisms.

Despite their strong global modeling capability, most transformer-based methods suffer from three common limitations in infrared and visible scenarios: high computational cost, insufficient sensitivity to local high-frequency details, and the absence of explicit frequency-domain modeling. These issues limit their effectiveness in complex scenes where accurate target boundary localization and frequency-selective fusion are crucial. These limitations motivate the exploration of alternative fusion strategies that operate in the frequency domain, rather than relying solely on attention-based feature interaction.

2.2. Frequency-Domain Modeling for Infrared and Visible Image Detection

Frequency-domain modeling provides an alternative and complementary perspective for infrared and visible feature representation and fusion. By transforming features into the frequency domain, such methods can effectively expand the receptive field and decouple structural information across different frequency bands. Tatsunami et al. [10] leveraged FFT-based token mixing to enhance global context modeling. However, their frequency modeling strategy is global and static, lacking spatial adaptivity and the ability to selectively emphasize task-relevant frequency components. Lyu et al. [25] introduced a deep Fourier-embedded network to strengthen frequency feature representation for infrared–visible object detection, while Ref. [26] proposed a frequency-aware attention module for redundancy removal. Collectively, these frequency-driven methods have explored expansive feature extraction and complementary information mining for multi-modal object detection. Furthermore, unified spatial frequency information processing has been recently introduced into infrared–visible object detection [27].

Recently, Wu et al. [28] proposed a frequency-driven transformer for infrared and visible image object detection via frequency decomposition and aggregation. Similarly, both abundant high-frequency details and valuable low-frequency components are considered in [29]. The central motivation of these frequency-driven methods lies in selective feature extraction between single-modal and cross-modal cases. However, these existing methods often neglect the spatial-channel feature redundancy inherent in both single-modal and cross-modal scenarios. To address this, our method introduces learnable selective filter generation to complement local inductive biases and avoid high complexity in feature extraction. To enable the enhancement of modality-consistent frequency components while suppressing interference, we design the Homogeneous Frequency Refined Block (HFRB) and the Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB). Our work further advances frequency-domain modeling by introducing these two modules, enabling spatially adaptive and modality-consistent fusion of cross-modal features.

3. SCFusion Method

This section presents the details of the overall architecture and key modules of the proposed infrared and visible image object detection framework, which focuses on accurate localization and boundary-aware prediction of salient objects. As illustrated in Figure 1, the proposed framework mainly consists of the following core modules: the frequency residual selective transformer (FREFormer), the Homogeneous Frequency Refined Block (HFRB), the Heterogeneous Spatial-Channel Frequency Fusion Block (HSRB), and the Frequency Reconstruction Guided Module (FRGM), along with a joint loss function for optimization.

3.1. Overall Network Architecture

First, visible and infrared images are input into two parallel weight-sharing encoders to extract hierarchical features through a series of downsampling layers and FREFormer Blocks. Within each encoder level, Homogeneous Frequency Refined Blocks (HFRBs) are employed to capture intra-modal frequency details, while Heterogeneous Spatial-Channel Frequency Fusion Blocks (HSFBs) bridge the two branches to integrate cross-modal complementary information. These fused features are then progressively fed into a four-stage decoder consisting of the Frequency Reconstruction Guided Module (FRGM). Each FRGM layer adaptively reconstructs target-related frequency components from the fused features to provide fine-grained spatial guidance, ultimately facilitating precise target detection and segmentation through multi-scale frequency-domain refinement.

3.2. FREFormer Encoder of Infrared and Visible Image

In the proposed method, the extensive use of Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) is essential. Theoretically, point-wise multiplication in the frequency domain is equivalent to circular convolution in the spatial domain. Since convolution operations inherently exhibit translational equivariance, this theoretical property holds between the frequency and spatial domains. However, in practical applications, FFT assumes signal periodicity, which does not hold for images, leading to the introduction of boundary artifacts. Specifically, due to the finite nature of images, the periodicity assumption of FFT causes discontinuities at the boundaries, generating artifacts that affect analysis. To address this issue, SCFusion adopts a hybrid design. Specifically, SCFusion integrates frequency-domain modules with MetaFormer-style blocks [30] and alternates them with standard convolutional layers (e.g., in the downsampling and encoder stages). These convolutional layers introduce strong local inductive biases, effectively mitigating the impact of boundary artifacts. The local convolutional layers perform smoothing operations in local regions of the image, alleviating artifacts caused by boundary discontinuities. At the same time, this design enhances the network’s ability to capture local details, further improving the precision of image processing.

Based on this MetaFormer-style block structure, we designed the entire encoder module, which is the FREFormer Encoder. As illustrated in Figure 2, the FREFormer Encoder adopts a two-stage hierarchical structure, with each stage containing two layers: the downsampling layer and the FREFormer Block. This hybrid approach leverages the local inductive bias of convolutions in early stages and the global receptive field of FFT-based filters in later stages. The downsampling layer reduces spatial resolution while increasing channel capacity. Given input feature

X_{i}

, the module of the downsampling layer can be expressed as the following equation:

Y_{i} = LN (Conv 2 d (LN (X_{i}))),

(1)

where

L N (*)

denotes the Layer Normalization.

Following downsampling, the network retains significant frequency information, comprising both target-related details essential for detection and irrelevant background noise. The amplification or suppression of these frequency signals is critical for detection performance, creating a pressing need for an effective frequency-processing filter.

Conventional spectral filtering approaches typically learn a static, global complex filter. Once trained, this filter remains fixed and is applied indiscriminately to all input images. This “content-agnostic” paradigm assumes that a single spectral modulation strategy can generalize across all scenarios. However, real-world scenes are non-stationary; the frequency distribution of a cluttered scene differs significantly from that of a clean, simple object. A static filter lacks the flexibility to adapt to these varying semantic contents, often leading to suboptimal feature extraction where noise is amplified or edges are blurred.

To address these limitations, we propose a dynamic filtering architecture: the learnable selective filter. The learnable selective filter serves as the overarching architectural unit designed to replace the multi-head self-attention mechanism in the transformer, defining the complete pipeline for frequency-domain modulation. Embedded within the learnable selective filter is its core functional module, the learnable selective filter generation (LSFG), which acts as the brain of the filter.

The primary motivation behind LSFG is to introduce adaptability into frequency-domain processing. Unlike static approaches that use a fixed weight matrix, LSFG functions as a dynamic parameter generator, synthesizing a unique filter weight

F (Y_{i})

tailored to each input instance

Y_{i}

. The mathematical formulation of LSFG is defined in Equation (2):

\begin{matrix} E (X) = \frac{\sum_{i, j} X_{i, j}}{H W} \\ G (E (X)) = \sum_{i = 1}^{N} (\frac{E {(X)}_{i, j}}{\sum_{i, j} E {(X)}_{i, j}}) \end{matrix} .

(2)

Specifically, the module initiates by capturing global semantics through

E (X) = \frac{\sum_{i, j} X_{i, j}}{H W}

, which functions as a global average pooling operator to compress spatial information into a concise global descriptor representing the image’s overall “energy.” Building on this statistical profile, the generator

G

synthesizes the dynamic filter weights via a normalization process, expressed as

G (E (X)) = \sum_{i = 1}^{N} (\frac{E (X) i, j}{\sum i, j E {(X)}_{i, j}})

. By explicitly conditioning the filter generation on these input-specific characteristics derived from

E (X)

, the model achieves “instance-aware” processing. This ensures that the filter evolves adaptively with the data, dynamically distinguishing informative frequency components from noise for each specific instance.

With the dynamic weights generated by LSFG, the overall learnable selective filter operates as described in Equation (3):

\begin{matrix} F (Y_{i}) = G (MLP (Softmax (Y_{i}))) \\ L (Y_{i}) = F^{- 1} (F (Y_{i}) ⊙ F (Y_{i})) \end{matrix} .

(3)

The process involves two key steps: first, filter synthesis, where input features are processed via MLP and Softmax layers to generate instance-specific weights

F (Y_{i})

; and second, dynamic modulation, where the synthesized filter

F (Y_{i})

modulates the input spectrum

F (Y_{i})

via element-wise multiplication, followed by feature reconstruction using the Inverse FFT (

F^{- 1}

).

This design establishes a coherent mechanism where the spatial content (captured by

E

) explicitly governs the spectral bias

F

. By adapting to the input, the model can intelligently preserve high-frequency components to maintain sharpness in textured regions or suppress specific bands to reduce noise in smooth backgrounds. Consequently, by synthesizing the global receptive field of FFT with adaptive, instance-specific weights, LSF offers a compelling alternative to multi-head self-attention.

Finally, integrating this mechanism into the network, the overall FREFormer Block is formulated as:

\begin{matrix} F_{A} = X + LSF (LN (X)) \\ F_{B} = F_{A} + ChannelMLP (LN (F_{A})) \end{matrix} .

(4)

3.3. Cross-Frequency Guided Interaction Module (CFGIM)

Building upon the frequency-enhanced representations extracted by the FREFormer, we introduce the Cross-Frequency Guided Interaction Module (CFGIM), as illustrated in Figure 3, to resolve inherent conflicts between infrared and visible modalities. Unlike traditional fusion methods that implicitly assume strict pixel-to-pixel alignment in the spatial domain, CFGIM reframes multi-modal interaction as a spectral decoupling and recombination process.

Our design is grounded in a fundamental property of Fourier analysis: phase encodes spatial structural information, while amplitude represents signal intensity. Motivated by this physical interpretability, the CFGIM employs a two-stage interaction mechanism:

(1) Homogeneous Frequency Refined Block (HFRB): This module serves as a content-aware active filter. Prior to fusion, it adaptively suppresses modality-specific noise (e.g., thermal grain or background clutter) within each branch, effectively preventing noise amplification during the interaction process.

(2) Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB): This module performs the core structure–content decoupling. It explicitly aligns the structural edges of the two modalities via phase calibration—thereby addressing the spatial misalignment issue—while simultaneously fusing their salient intensities via amplitude interaction.

This divide-and-conquer strategy in the frequency domain provides a transparent mechanism for cross-modal integration, ensuring that the fused features inherit both the sharpest structures and the most salient semantics from the respective inputs.

3.3.1. Homogeneous Frequency Refined Block (HFRB)

We design a Homogeneous Frequency Refined Block (HFRB) from a dual-branch mechanism to predict two dense refined weights

T_{i r}

and

T_{v i s}

for element-wise adjustment. Let us define uniformly as

T_{0}

as the input of this block,

T_{0} \in R^{C \times H \times W}

. In this module, features will be processed through two branches: a global grouped aggregation branch and a local frequency calibration branch.

The upper branch is primarily responsible for capturing global context information and performing grouped weighted fusion, in which this strategy is similar to Split-Attention. First, the input feature

T_{0} \in R^{C \times H \times W}

is spatially compressed via global average pooling (AvgPool), which is then passed through a fully connected (FC) layer and a Sigmoid activation function to generate global channel weight

w_{g l o b a l}

. This global channel weight would be used for weighting the frequency-domain feature.

Furthermore, the frequency-guided mechanism aimed at enhancing frequency feature expression by capturing local dependencies between channels and performing the final feature calibration. This branch also utilizes an average pooling layer with 1D Convolution (Conv1D) and a fully connected (FC) layer. It can be formulated as:

F_{k} = Sigmoid (Conv 1 D (FC (AvgPool (T_{0})))) .

(5)

This design effectively captures interaction information between adjacent frequency components (i.e., adjacent channels) in the frequency domain, avoiding the destruction of local frequency correlations caused by fully connected layers.

Simultaneously, the input feature

w_{g l o b a l}

is split along the channel dimension into K groups. These groups are individually weighted using the generated weights via element-wise multiplication (Hadamard product). Finally, these weighted grouped features are aggregated through element-wise addition (ADD) to obtain the intermediate feature

{\tilde{T}}_{1}

. This process can be formulated as:

{\tilde{T}}_{1} = \sum_{k} (F_{k} ⊙ w_{g l o b a l}^{k}) .

(6)

It aims to adaptively retain salient global features from either infrared or visible modalities through a gating mechanism. Through this dual-attention mechanism (global grouped aggregation + local frequency calibration), the model effectively preserves both texture details and target features.

3.3.2. Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB)

The detailed execution flow of the HSFB is summarized in Algorithm 1. Unlike standard CNN modules that operate on entangled features, the HSFB is designed to explicitly decouple the “where” (spatial structure) from the “what” (semantic intensity).

Algorithm 1 Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB).

Require:: Input features ${\tilde{T}}_{1}, {\tilde{T}}_{2} \in R^{C \times H \times W}$ (IR and visible);
Require:: Learnable blocks $CBS (\cdot)$ , $CLC (\cdot)$ .
Ensure:: Fused feature map $F_{o u t}$ .

1:: for $k \in {1, 2}$ do ▹ Parallel processing for each modality
2:: $X_{k} \leftarrow CBS ({\tilde{T}}_{k})$ ▹ Projection to latent space
// Step 1: Spatial Decomposition
3:: $S_{k} \leftarrow F_{spatial} (X_{k})$
4:: $A_{k}^{s p}, P_{k}^{s p} \leftarrow | S_{k} |, ∠ S_{k}$ ▹ Save $P_{k}^{s p}$ for final reconstruction
// Step 2: Channel Frequency Interaction
5:: $C_{k} \leftarrow F_{channel} (A_{k}^{s p})$ ▹ FFT along channel dimension
6:: $A_{k}^{c h}, P_{k}^{c h} \leftarrow | C_{k} |, ∠ C_{k}$
7:: ${\hat{A}}_{k}^{c h} \leftarrow CLC (A_{k}^{c h}); {\hat{P}}_{k}^{c h} \leftarrow CLC (P_{k}^{c h})$ ▹ Learnable modulation
// Step 3: Dual-Domain Reconstruction
8:: ${\hat{C}}_{k} \leftarrow {\hat{A}}_{k}^{c h} \cdot e^{j \cdot {\hat{P}}_{k}^{c h}}$
9:: ${\hat{A}}_{k}^{s p} \leftarrow F_{channel}^{- 1} ({\hat{C}}_{k})$ ▹ Obtain refined intensity
// Step 4: Structure–Content Recombination
10:: ${\hat{S}}_{k} \leftarrow {\hat{A}}_{k}^{s p} \cdot e^{j \cdot P_{k}^{s p}}$ ▹ Inject refined intensity into original skeleton
11:: $F_{k}^{branch} \leftarrow F_{spatial}^{- 1} ({\hat{S}}_{k})$
12:: end for
13:: $F_{o u t} \leftarrow F_{1}^{branch} + F_{2}^{branch}$ ▹ Element-wise fusion
14:: return $F_{o u t}$

Given the frequency-refined features

{\tilde{T}}_{1}

(infrared) and

{\tilde{T}}_{2}

(visible) generated by the preceding HFRB module, these features are first processed by a CBS block to project them into a shared embedding space. To initiate the decoupling, we transform the features into the spatial frequency domain via 2D FFT. This step extracts two distinct physical components: the spatial phase (

P^{s p}

), which represents the “skeleton” of the image by encoding structural edges, and the spatial amplitude (

A^{s p}

), which represents the “muscle” or signal intensity. This decomposition is formulated as:

\begin{matrix} A_{k}^{s p}, P_{k}^{s p} = F_{spatial} (CBS ({\tilde{T}}_{k})), k \in {1, 2} \end{matrix},

(7)

where

F_{spatial}

denotes the Spatial FFT. By isolating

P^{s p}

, we preserve the precise location of objects, effectively preventing the boundary blurring often caused by spatial misalignment.

While the spatial phase is preserved to maintain structure, the amplitude (intensity) requires cross-channel calibration to highlight salient targets. We map the spatial amplitude

A^{s p}

into the channel-frequency domain using Channel-wise FFT (CFFT), treating the channel dimension as a temporal sequence to capture inter-channel dependencies:

A_{k}^{c h}, P_{k}^{c h} = F_{channel} (A_{k}^{s p}) .

(8)

In this domain, the CLC module (comprising

1 \times 1

convolutions) acts as a learnable frequency filter. It dynamically modulates the channel spectrum to suppress background noise and enhance target-related responses:

\begin{matrix} {\hat{A}}_{k}^{c h} = CLC (A_{k}^{c h}), {\hat{P}}_{k}^{c h} = CLC (P_{k}^{c h}) \end{matrix} .

(9)

Structure-Preserving Reconstruction is the most critical step for addressing cross-modal interference. We first reconstruct the refined spatial amplitude using the inverse channel FFT (

F_{channel}^{- 1}

):

{\hat{A}}_{k}^{s p} = F_{channel}^{- 1} ({\hat{A}}_{k}^{c h} \cdot e^{j \cdot {\hat{P}}_{k}^{c h}}) .

(10)

Then, strictly following the logic in Algorithm 1, we perform a Structure–Content Recombination. We combine the calibrated intensity (

{\hat{A}}_{k}^{s p}

) with the original spatial phase (

P_{k}^{s p}

). This ensures that the enhanced semantic features are perfectly aligned with the original object boundaries, effectively solving the spatial misalignment issue:

F_{k}^{branch} = F_{spatial}^{- 1} ({\hat{A}}_{k}^{s p} \cdot e^{j \cdot P_{k}^{s p}}) .

(11)

Finally, the features from both branches are fused via element-wise addition to integrate the complementary information:

F_{o u t} = F_{1}^{branch} + F_{2}^{branch} .

(12)

Through this mechanism, the HSFB achieves a physically interpretable fusion: it “transplants” the robust, noise-free semantic intensities onto the precise structural skeletons of the source images.

3.4. Frequency Reconstruction Guided Module (FRGM) for Decoder

Transitioning from deep semantic features back to high-resolution detection masks presents a critical challenge: the “Information Attenuation” dilemma. Standard decoders typically rely on spatial upsampling, which tends to smooth out high-frequency details, resulting in the common “blurred boundary” problem where object edges become ambiguous. To overcome this limitation, we design the Frequency Reconstruction Guided Module (FRGM). As illustrated in the decoder architecture, the FRGM operates on the rigorous principle of Multi-resolution Analysis (MRA), fundamentally transforming the decoding process from simple interpolation into a spectral-aware reconstruction.

Specifically, the FRGM decomposes the fused features into distinct frequency bands, assigning explicit physical roles to each component: Low-frequency bands serve as the Semantic Anchor. They guide the coarse localization of salient objects, ensuring that the global shape and category information remain consistent during upsampling. High-frequency bands serve as the Boundary Sharpener. They are explicitly isolated and enhanced to regress sharp contours and fine-grained textures, which are typically the first to be lost in deep networks.

This coarse-to-fine reconstruction strategy physically guarantees that the model produces detection masks with precise edges. By mathematically enforcing the recovery of high-frequency components to sharpen target boundaries.

The FRGM module decomposes the fused feature

O_{f}

into multi-frequency bands to enhance target edge localization in the decoder. First, the weight

M

of

F_{i}

is designed for frequency selection into four frequency bands with different thresholds:

M_{b} (i, j) = \{\begin{matrix} 1, & F_{i - 1}^{b} \leq ∥ (i, j) ∥ F_{i}^{b} \\ 0, & otherwise \end{matrix},

(13)

where

M_{b}

is the binary mask of the bth frequency band, and

f_{b}

is the frequency threshold. The frequency band feature is obtained by multiplying the Fourier transform of

O_{f}

with

M_{b}

and inverse Fourier transform:

F_{b} = F^{- 1} (F (O_{f}) ⊙ M_{b}) .

(14)

A convolution layer + Sigmoid activation is used to generate a modulation map

A_{b} \in R^{H \times W \times C}

for the bth frequency band:

A_{b} = Sigmoid (Conv (F_{b})) .

(15)

The final enhanced feature is:

F_{f i n a l} = \sum_{b = 1}^{4} F_{b} ⊙ A_{b} .

(16)

Indeed, the FRGM is embedded into the decoder for object detection. The decoder block consists of a 3 × 3 convolution layer, batch normalization, and GELU activation. The decoder consists of four progressive stages, where the FRGM serves as the core enhancement module at each level. Let

D_{l}

denote the output feature of the lth decoder stage (

l \in {1, 2, 3, 4}

). To effectively leverage the multi-frequency information, the fused feature

O_{f}

from the fusion bottleneck is downsampled or upsampled to match the spatial resolution of each decoder layer, denoted as

O_{f}^{l}

. The FRGM then processes

O_{f}^{l}

to generate the frequency-enhanced guidance feature

F_{f i n a l}^{l}

.

Specifically, the lth decoder block integrates the upsampled feature from the previous layer

D_{l - 1}

and the frequency-guided feature

F_{f i n a l}^{l}

as follows:

D l = GELU (BN (Conv 3 \times 3 ([Up (D l - 1); F {f i n a l}^{l}]))) .

(17)

The FRGM acts as a multi-scale frequency filter across all four stages, ensuring that both high-frequency edge details and low-frequency semantic consistency are adaptively injected into the reconstruction process. This hierarchical integration allows the decoder to precisely localize targets by reconstructing sharp boundaries from the multi-frequency components.

3.5. Loss Function

To enhance the model’s ability to capture target edges and structure, a joint loss function is designed, including Frequency Fidelity Loss (

L_{f f}

), Dice (

L_{d i c e}

) and Cross Entropy loss (

L_{c e}

).

The edge detection map

E_{p r e d}

and ground truth

E_{g t}

(obtained by the Canny Algorithm [31]) are transformed by 2D FFT:

F_{p r e d} = F (E_{p r e d}), F_{g t} = F (E_{g t}) .

(18)

The frequency fidelity error of the bth frequency band is:

L_{f f}^{b} = \frac{1}{H W} {∥ F_{p r e d} - F_{g t} ∥}_{2}^{2} .

(19)

The multi-band frequency fidelity error loss is:

L_{s p e c} = \sum_{b = 1}^{4} w_{b} L_{f f}^{b},

(20)

where

w_{b}

is a learnable weight coefficient. The final joint loss function is:

L_{t o t a l} = λ_{1} L_{f f} + λ_{2} L_{d i c e} + λ_{3} L_{c e},

(21)

where

λ_{1}

,

λ_{2}

and

λ_{3}

are balance coefficients (set to 0.6 and 0.4 in experiments).

4. Experiments

To verify the effectiveness of the proposed algorithm, extensive experiments are conducted on four public infrared and visible image object detection datasets. This section details the experimental settings, results analysis, robustness analysis in hazy and complex scenarios and ablation studies.

4.1. Experimental Settings

All experiments are conducted on the Python 3.12 framework under Ubuntu 18.04 LTS with an NVIDIA GeForce RTX 4060 (8GB). The proposed method is trained with a batch size of eight, and the number of epochs is 300. The initial learning rate is set to

3 \times 10^{- 5}

(decayed by 0.1 every 100 epochs), and we apply the Adam optimizer (betas = (0.9, 0.999), weight decay =

1 \times 10^{- 4}

). The random cropping (384 × 384), horizontal flipping (p = 0.5), and rotation (±15°) data augmentations are applied for training. In order to obtain a fair evaluation, we select four public infrared and visible object detection datasets with a split ratio of 70% (training), 20% (validation), and 10% (testing). VT821 [32] contains 821 pairs of visible–infrared images, with manual registration and partial missing infrared regions. VT1000 [33] includes 1000 pairs of high-precision aligned images with fine pixel-level annotations. VT5000 [13] has 5000 pairs of images covering complex scenarios such as occlusion and low illumination. VI-RGBT1500 [34] has 1500 pairs of images with diverse scenes (indoor/outdoor, day/night) and accurate annotations. Nine state-of-the-art methods, including CSRNet [19], MIDD [35], OSRNet [18], SwinNet [22], TNet [23], LSNet [24], IFFNet [25], DFENet [36], and LAFB [2] are used in the standard evaluation. Finally, five common metrics for object detection are used: the E-measure (Em), measuring global and local alignment [37], the S-measure (Sm), measuring structural similarity [38], the F-measure (

F_{β}

), which is the weighted harmonic mean of precision and recall [39], the weighted F-measure (

ω F_{β}

), emphasizing spatial structure [40], and the MAE, Mean Absolute Error (lower is better) [41]. In addition, we selected the M3FD [42] dataset, which includes a large number of target images captured under challenging conditions such as haze and dense fog. The dataset consists of high-resolution images with varying sizes, ranging from 800 × 600 to 1920 × 1080 pixels, ensuring a broad representation of real-world environmental conditions. These images provide valuable data for evaluating object detection performance under degraded visibility conditions.

4.1.1. Enhanced-Alignment Measure

The E-measure comprehensively considers global statistical information at the image level and local pixel alignment to measure the overall consistency between the predicted map P and the ground truth G. Its definition is:

Q = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} ϕ (P (x, y), G (x, y)),

(22)

where W and H represent the width and height of the image respectively, and

ϕ (\cdot)

represents the enhanced alignment function, which aims to simultaneously evaluate local pixel consistency and overall structural alignment capabilities. They represent the values of the predicted image and the corresponding ground truth image at the pixel point respectively.

4.1.2. Structure Measure

The S-measure evaluates the similarity between the predicted map and the ground truth from a structural perspective, combining two aspects: object-aware and region-aware. It is defined as:

S_{α} = α \times S_{o} + (1 - α) \times S_{r},

(23)

where

S_{o}

represents the structural similarity at the object level,

S_{r}

represents the structural similarity at the region level, and

α \in [0, 1]

is a weighting coefficient, typically set to

α = 0.5

.

4.1.3. F-Measure

The F-measure is the weighted harmonic mean of precision and recall, used to comprehensively evaluate model performance:

F_{β} = \frac{(1 + β^{2}) \times Precision \times Recall}{β^{2} \times Precision + Recall} .

(24)

4.1.4. Weighted F-Measure

To overcome the deficiencies of the traditional F-measure regarding interpolation defects and spatial dependencies, the weighted F-measure was introduced. This metric incorporates pixel-level weights into the calculation, placing greater emphasis on the influence of spatial structural information and error distribution, thereby improving the robustness of the model performance evaluation.

4.1.5. Mean Absolute Error

The MAE measures the average difference between the predicted map and the ground truth at the pixel level; a smaller value indicates that the predicted map is closer to the real image, signifying better model performance. Its formula is:

M A E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | P (x, y) - G (x, y) | .

(25)

4.2. Experimental Results and Analysis

4.2.1. Results on VT821 Dataset

To intuitively display the differences in algorithm performance, Figure 4 presents the visual comparison results of the method in this chapter against mainstream RGB-T saliency detection algorithms such as CSRNet, MIDD, and OSRNet in typical multi-modal scenarios. Among them, the GT (ground truth) serves as the true saliency annotation map, providing a standard for objectively assessing the accuracy of each algorithm in object boundary restoration and salient region identification. Experimental results show that traditional models like CSRNet, MIDD, and OSRNet can achieve basic object highlighting when dealing with scenarios involving simple structures and high contrast between the target and background. However, once faced with complex scenarios involving complex background interference or multi-object overlapping occlusion, the limitations of these models are fully exposed: salient regions appear severely fragmented, object boundaries are blurred, and it even leads to missed detections of key targets. In contrast, while new-generation models like SwinNet, TNet, and LSNet have made certain progress in object contour extraction, they still exhibit obvious shortcomings in fine detail depiction and overall object coherence expression. For example, when processing objects with complex textures or irregular shapes, these models struggle to fully capture all the detailed features of the targets, resulting in fractures or omissions in the saliency prediction results. Conversely, the method proposed in this chapter demonstrates excellent detection performance in all test scenarios. Whether facing small-sized targets or low-contrast weak targets, the model can output clear and complete salient region prediction maps. It not only accurately outlines object edges but also effectively suppresses background noise interference, achieving high-resolution detail expression and strong robustness in object coherence representation.

The quantitative evaluation results on the VT821 dataset are shown in Table 1. The comparison of key metrics between the method in this chapter and mainstream RGB-T saliency detection models highlights significant advantages. Looking at the E-measure (enhanced-alignment measure), the method in this chapter ranks first with a value of 0.934, a 2.3% improvement over the runner-up model DFENet (0.913), indicating that the model achieves a better balance between global structural alignment and local detail consistency. The S-measure (structure measure) surpasses with DFENet for the highest value at 0.908, indicating that both are at the same leading level in characterizing object structural integrity, but combining this with other metrics further reflects the comprehensive advantage of this chapter’s method. In the

F_{β}

and

ω F_{β}

metrics, this method achieves optimal values of 0.875 and 0.869 respectively. Especially compared to methods like SwinNet (0.844/0.847) and IFFNet (0.849/0.848), the dual representation capability for high-frequency details (such as small target textures) and low-frequency semantics (such as large target contours) is significantly enhanced. It is worth noting that for the MAE (Mean Absolute Error), a core indicator reflecting the overall prediction error, this method ranks second with a low value of 0.025, only slightly higher than DFENet (0.023). However, combined with the comprehensive lead in the first four metrics, this indicates that the model achieves more precise salient region localization while reducing background false detections. Integrating the performance across five metrics, the method in this chapter achieves a better balance in saliency detection integrity, structural consistency, detail resolution, and error control through cross-modal feature fusion and refined frequency-domain modeling, validating the effectiveness and robustness of the algorithm in this chapter.

4.2.2. Results on VT1000 Dataset

Figure 5 displays visual comparison results in typical scenarios of the VT1000 dataset, covering diverse saliency detection tasks such as a single person standing, animal contours, circular lifebuoys, distant weak targets, inscription structures, and box arrangements. By comparing with the ground truth saliency annotations (GT), differences in the capabilities of different methods regarding salient region localization, boundary depiction, and background suppression can be clearly observed. In the “person standing” scenario in the first row, methods like CSRNet, MIDD, and OSRNet can roughly outline the target area, but edges generally suffer from blurring and expansion phenomena. The “circular lifebuoy” in the third row represents a medium target with a closed shape, suitable for examining the model’s edge closure. LSNet and TNet show significant fractures in the inner circle area; while LAFB and DFENet are integral in form, their edges are thick and exhibit fusion redundancy. The “multi-inscription structure” in the fifth row belongs to a multi-target arrangement scenario, posing challenges to saliency consistency and segmentation precision. Apart from CSRNet and MIDD, most methods predict the overall structure relatively well but suffer from partial target fusion or blurred edge issues. The method in this paper maintains clear boundaries between all sub-targets, demonstrating good saliency separation capability. The method in this paper performs best in edge details and shape recovery. The model proposed in this chapter demonstrates superior salient region consistency, edge clarity, and target coherence across multiple typical scenarios, capable of more stably handling interference and compensation relationships between multi-modal data.

From the quantitative results in Table 2, it is evident that early methods like CSRNet, MIDD and OSRNet lag relatively behind in multi-metric performance. Taking MIDD as an example, its MAE metric is as high as 0.042, reflecting significant defects in the fusion processing of the infrared modality, which easily leads to blurring of salient regions or false activation of the background. Although OSRNet shows slight improvements in some metrics, the fragmentation of salient regions in complex structure scenarios is obvious, resulting in lower S-measure (0.874) and

ω F_{β}

(0.841) metrics, exposing its deficiency in integrating multi-modal spatial structural information. SwinNet and TNet enhanced global modeling capabilities by introducing transformer architectures, improving the coherence of salient regions, with

F_{β}

metrics rising to 0.848 and 0.862 respectively. However, due to the lack of collaborative regulation mechanisms for inter-modal frequency-domain features and spatial structures, their performance stability in complex scenarios still needs improvement. In contrast, the algorithm in this chapter performs best across all metrics: the E-measure is 0.951, the S-measure is 0.940, and the MAE is as low as 0.016, achieving 3.3% improvement over the runner-up method DFENet in the E-measure. This result indicates that through deep collaborative modeling of inter-modal frequency and spatial features, the algorithm effectively improves salient region integrity, structural consistency, and background suppression capabilities, with overall performance comprehensively surpassing current mainstream methods, highlighting its leading advantage in cross-modal saliency detection.

4.2.3. Results on VI-RGBT1500 Dataset

Figure 6 displays the visual detection results on the VI-RGBT1500 dataset, covering multiple typical and challenging scenarios such as small object detection, complex background interference, irregularly shaped objects, occlusion interference, and low-contrast environments. By comparing with the ground truth (GT), the performance differences of various methods in salient region localization, boundary depiction, object integrity maintenance, and background information suppression can be intuitively observed. In the small object detection scenario in the first row, although methods like CSRNet, MIDD, and OSRNet can detect the target in the rough area, the object edges in their prediction maps are blurred, the forms are unclear, and obvious pseudo-responses exist, demonstrating weak perception capabilities for small-scale targets, especially significant defects in boundary preservation. TNet and LSNet show improvements in boundary expression, but issues with inconsistency between target contours and actual annotations remain, and the fineness and stability of prediction results are insufficient. In the irregular object detection task in the fifth row, many comparison methods struggle to maintain the integrity of the overall structure when facing objects with complex shapes and irregular contours, often resulting in edge distortion, absence, or over-smoothing, which affects the model’s accurate expression of object morphology. Especially in areas with strong background interference or unclear saliency edges, traditional methods are more prone to salient region overflow or loss of internal target details. In contrast, the method proposed in this chapter exhibits superior visual performance in all scenarios. Whether for fine localization of small objects, effective separation under complex backgrounds, or in edge depiction and shape restoration, this method can generate clear, compact, and structurally complete prediction maps.

The quantitative experimental results on the VI-RGBT1500 dataset are shown in Table 3. The method proposed in this chapter performs best overall across five mainstream evaluation metrics, fully validating its superior performance in multi-modal saliency detection tasks. Among them, it achieved the highest scores of 0.926, 0.876, and 0.874 in E-measure,

F_{β}

, and

ω F_{β}

metrics respectively, significantly outperforming existing advanced methods like DFENet and IFFNet. This indicates that the method has obvious advantages in maintaining the overall structure of salient regions and target detection accuracy. At the same time, it reached 0.912 on the structural similarity metric S-measure, performing on par with the current best method, further explaining its strong expression capability in global and local structural perception, enabling effective perception of the spatial layout and morphological features of salient targets in images. Furthermore, this chapter’s method achieved the lowest value of 0.020 on the MAE metric, significantly superior to all comparison algorithms, indicating smaller errors in saliency map prediction and more refined boundary depiction. Combining all metrics, the method in this chapter demonstrates stronger robustness and detection precision in scenarios involving complex backgrounds, small targets, and significant modal differences, verifying the model’s broad adaptability and promotion value in practical applications.

4.2.4. Results on VT5000 Dataset

Figure 7 shows six sets of comparison examples from the VT5000 dataset. From the complex background interference in the second row and the extremely low light conditions in the sixth row, it is visible that methods like CSRNet and SwinNet generally suffer from background confusion and incomplete object detection. In scenarios where chair structures exist within strong interfering background textures, except for DFENet and LAFB, which perform relatively better, most methods exhibit salient region omission or edge drift phenomena. In contrast, the method in this paper demonstrates excellent structural perception capabilities in detailed areas like chair backs and legs; it not only positions edges accurately but also effectively recovers occluded areas, demonstrating stronger robustness and perception precision.

The quantitative experimental results on the VT5000 dataset are shown in Table 4. This dataset covers various challenging scenarios such as complex background interference and low light, posing high demands on the robustness and detail perception capabilities of saliency detection algorithms. From the table, it can be observed that CSRNet and MIDD have significant bottlenecks in core performance, reflecting the dual defects of such methods in target structure maintenance and edge positioning precision, especially in low-contrast scenarios where salient region blurring easily occurs.

Although improved models like TNet and LSNet have made some progress by optimizing feature extraction paths, limited by the local perception characteristics of traditional convolutional architectures, they have shortcomings in cross-modal global semantic correlation modeling, and their comprehensive performance still lags behind new-generation fusion schemes. DFENet and LAFB demonstrate stronger multi-modal feature fusion capabilities by introducing attention mechanisms and multi-scale feature interaction, approaching advanced levels in

F_{β}

and

ω F_{β}

metrics, indicating the effectiveness of dynamic feature selection mechanisms for performance improvement. However, the method proposed in this chapter achieves the best results in all five metrics: the E-measure, S-measure,

F_{β}

,

ω F_{β}

, and MAE. Specifically, compared to DFENet, the E-measure improved by 0.2%, and the S-measure by 0.4%; compared to LAFB,

F_{β}

improved by 1.5%. It performs particularly well in edge preservation and regional consistency, validating its stronger perception capability and stability in complex scenarios.

4.2.5. Computational Efficiency Analysis

To comprehensively evaluate the practical applicability of the proposed method, we compare its computational complexity and inference speed with state-of-the-art methods on the VT5000 dataset. As presented in Table 5, our SCFusion achieves the best performance with an E-measure of 0.920, while maintaining a highly efficient design.

Specifically, compared to the top-performing competitors such as DFENet and LAFB, our method demonstrates a significant advantage in efficiency. DFENet, despite achieving a high E-measure of 0.918, suffers from an excessively heavy computational burden (398.05 M Params and 552.1 G FLOPs) and a low inference speed of 37 FPS. In sharp contrast, our method reduces the parameter count by approximately 11× (36.04 M) and FLOPs by nearly 95× (5.78 G), while boosting the inference speed to 277 FPS, which is roughly 7.5× faster than DFENet.

Furthermore, compared to the transformer-based SwinNet (22 FPS), our SCFusion achieves real-time speeds, validating the efficiency of our frequency-domain token mixing strategy over quadratic self-attention mechanisms. Although the lightweight LSNet achieves slightly higher FPS (314 FPS), it comes at the cost of detection accuracy (E-measure of 0.898). Our method strikes an optimal trade-off, delivering state-of-the-art accuracy with computational costs suitable for real-time deployment.

4.3. Robustness Analysis in Hazy and Complex Scenarios

To evaluate the model’s performance under adverse weather conditions (e.g., haze, fog) and sudden visual changes (e.g., occlusion, illumination flare), we visualized the intermediate feature maps of the proposed network. As shown in Figure 8, we selected four representative challenging scenes from the M3FD.

In scenarios with heavy fog, the visible images lose most textural details and contrast. However, the infrared modality remains unaffected. The visualization of the FREFormer output shows that the encoder successfully extracts initial global contexts. Subsequently, the HFRB module significantly suppresses the background noise (fog) by recalibrating the frequency components, effectively highlighting the target areas (e.g., cars and pedestrians) by leveraging the thermal signature. Row 3 depicts a night scene with strong glare (sudden visual change), and Row 4 shows targets occluded by smoke and trees. The FRGM in the decoder stage plays a crucial role here. By reconstructing high-frequency edge details, it sharpens the target boundaries that are otherwise blurred in the RGB image. The final feature maps demonstrate that our method maintains robust perception and precise localization even when one modality is severely corrupted.

4.4. Ablation Study

To comprehensively verify the effectiveness of the proposed algorithm, a systematic ablation study was conducted on the VT821 and VT5000 datasets, evaluating the Homogeneous Frequency Refined Block (HFRB), Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB), and Frequency Reconstruction Guided Module (FRGM). The quantitative results in Table 6 and Table 7 demonstrate that the complete configuration achieves optimal performance across all metrics. Specifically, on the VT5000 dataset, the full model reached an E-measure of 0.920, an S-measure of 0.909, and the lowest

M A E

of 0.023. These results confirm that the absence of any module causes varying degrees of performance degradation, validating the rationality and complementarity of the module designs in improving target boundary restoration and suppressing background interference.

The HFRB module plays a pivotal role in adaptively extracting salient features within the frequency domain, particularly in complex scenarios. Experimental results demonstrate that removing the HFRB module (refer to Row 1 in the tables) leads to a substantial deterioration in precision metrics. For instance, on the VT821 dataset, the exclusion of the HFRB results in a decrease in the

F_{β}

score from 0.875 (full model) to 0.862, while the

M A E

significantly increases from 0.025 to 0.029. This quantitative decline confirms that without the HFRB, the model struggles to effectively distinguish high-response regions, especially when the spectral characteristics of salient objects are similar to those of the background.

The HSFB is indispensable for integrating cross-modal frequency-domain information. As shown in Row 2, the absence of this module severely compromises the model’s fusion capability, which is reflected in the notable decline of structural and enhancement metrics. On the VT821 dataset, removing the HSFB results in the lowest S-measure of 0.896 (compared to 0.908 for the full model) and a reduction in the E-measure to 0.910. Similarly, on the VT5000 dataset, the

M A E

escalates to 0.034, the highest value among all variants. These statistics underscore the necessity of the HSFB for maintaining structural integrity and ensuring effective cross-modal fusion.

The FRGM is designed to ensure inter-modal coordination and prevent saliency shift. Ablation results indicate that eliminating the FRGM (Row 3 in the tables) impairs this coordination capability. Specifically, on the VT5000 dataset, the absence of the FRGM leads to a drop in the E-measure from 0.920 to 0.914, accompanied by an increase in the

M A E

from 0.023 to 0.028. Although the performance degradation is slightly less severe compared to the removal of the primary fusion blocks, the consistent decline across both datasets validates that the FRGM plays an essential role in maintaining salient region consistency and refining boundary details.

To pinpoint the specific factors driving our performance gains, we conducted a component-wise ablation study on the internal mechanisms of LSFG and the HSFB, as detailed in Table 8.

We first investigate the learnable selective filter (LSF) within the LSFG module, which is designed to replace traditional static filters. By substituting our dynamic generator with a fixed “static filter”, we observed a sharp performance decline, with the E-measure dropping from 0.920 to 0.833. This substantial gap confirms that the improvement stems not merely from operating in the frequency domain, but specifically from the input-adaptive nature of our design. Unlike static filters, which struggle with non-stationary scenes containing varying object scales and clutter, our selective mechanism dynamically adjusts spectral weights. This allows the model to surgically sharpen boundaries or suppress noise tailored to each specific image instance.

Subsequently, we dissect the Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB) to verify the contribution of its dual-branch design. Removing the Spatial Branch—which processes phase information via 2D FFT—leads to a significant drop in the S-measure (from 0.909 to 0.891). This result underscores the branch’s critical role in structural alignment, effectively correcting spatial shifts between infrared and visible modalities. Conversely, excluding the Channel Branch—which handles amplitude via CFFT—causes a notable decrease in

F_{β}

(from 0.867 to 0.831), indicating that channel-wise interaction is indispensable for feature selection and saliency enhancement. Ultimately, the full model achieves the best performance, demonstrating that the spatial and channel operations are mutually complementary: one guarantees structural precision, while the other ensures semantic strength.

5. Conclusions

In this paper, we propose a novel infrared and visible image object detection algorithm grounded in a frequency-domain multi-scale refinement framework. We introduce the FREFormer as a hierarchical encoder, which efficiently captures global spatial dependencies via an adaptive selective filter in the frequency domain. To mitigate cross-modal interference, we design the Cross-Frequency Guided Interaction Module (CFGIM), integrating the Homogeneous Frequency Refined Block (HFRB) for dual-branch channel calibration and the Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB) for robust spatial-channel fusion. Furthermore, we embed the Frequency Reconstruction Guided Module (FRGM) into a four-stage decoder, enabling the progressive injection of multi-band frequency components to enhance target edge localization. Finally, a joint loss function incorporating Frequency Fidelity Loss ensures the precise reconstruction of sharp boundaries and structural integrity. Experimental results demonstrate that our method effectively coordinates complementary information from dual modalities, achieving superior performance in complex environmental conditions.

In future work, we aim to further optimize the computational efficiency of the frequency-domain modules to facilitate real-time detection scenarios and explore the integration of temporal features for handling dynamic video sequences.

Author Contributions

Conceptualization, G.X. and Y.S.; methodology, G.X. and Y.S.; software, G.X. and Y.S.; validation, Y.S. and K.C.; formal analysis, G.X. and Y.S.; investigation, Y.S. and K.C.; resources, Y.Y. and L.D.; data curation, Y.S. and Y.Y.; writing—original draft preparation, G.X. and Y.S.; writing—review and editing, G.X., Y.S. and H.Z.; visualization, Y.S.; supervision, G.X. and H.Z.; project administration, G.X., G.X. and H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant Numbers 62401292 and 62476140, the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation under Grant Number GZC20240745, the Jiangsu Funding Program for Excellent Postdoctoral Talent under Grant Number 2024ZB682, and the Natural Science Research Start-up Foundation of Recruiting Talents of Nanjing University of Posts and Telecommunications (Grant No. NY225048).

Data Availability Statement

The data presented in this study are available in [VT821] at https://pan.baidu.com/s/1bpEaeQV (accessed on 12 November 2025), [VT1000] at https://www.kaggle.com/datasets/cindystoila/vt1000-aug (accessed on 12 November 2025), [VI-RGBT1500] at https://github.com/huanglm-me/VI-RGBT1500 (accessed on 12 November 2025), [VT5000] at https://www.kaggle.com/datasets/stoilacindy/vt5000-new (accessed on 12 November 2025), and [M3FD] at https://www.kaggle.com/datasets/nus1998/m3fd-dataset (accessed on 12 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, H.; Liu, Q.; Wang, J.; Ren, J.; Wu, Y.; Zhao, H.; Li, H. Fusion of infrared and visible images for remote detection of low-altitude slow-speed small targets. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2971–2983. [Google Scholar] [CrossRef]
Wang, G.; Li, C.; Ma, Y.; Zheng, A.; Tang, J.; Luo, B. RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and a Novel Approach. In Proceedings of the Image and Graphics Technologies and Applications: 13th Conference on Image and Graphics Technologies and Applications, IGTA 2018, Beijing, China, 8–10 April 2018; Revised Selected Papers; Springer: Singapore, 2018; Volume 13, pp. 359–369. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015; pp. 1037–1045. [Google Scholar]
Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273. [Google Scholar]
Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral pedestrian detection using deep fusion convolutional neural Networks. In Proceedings of the ESANN, Bruges, Belgium, 27–29 April 2016; Volume 587, pp. 509–514. [Google Scholar]
Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Lee, W.Y.; Jovanov, L.; Philips, W. Cross-modality attention and multimodal fusion transformer for pedestrian detection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 608–623. [Google Scholar]
Xiang, X.; Zhou, G.; Niu, B.; Pan, Z.; Huang, L.; Li, W.; Wen, Z.; Qi, J.; Gao, W. Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception. Remote Sens. 2025, 17, 3637. [Google Scholar] [CrossRef]
Tatsunami, Y.; Taki, M. FFT-based Dynamic Token Mixer for Vision. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2024; Volume 38, pp. 15328–15336. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Zhu, Z.; Zhou, J.; Lu, J. GFNet: Global filter networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10960–10973. [Google Scholar] [CrossRef] [PubMed]
Lyu, H.; Sha, N.; Qin, S.; Yan, M.; Xie, Y.; Wang, R. Manifold denoising by nonlinear robust principal component analysis. Adv. Neural Inf. Process. Syst. 2019, 32, 13390–13400. [Google Scholar]
Tu, Z.; Ma, Y.; Li, Z.; Li, C.; Xu, J.; Liu, Y. RGBT Salient Object Detection: A Large-Scale Dataset and Benchmark. IEEE Trans. Multimed. 2022, 25, 4163–4176. [Google Scholar] [CrossRef]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2021; pp. 3560–3569. [Google Scholar]
Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 5127–5137. [Google Scholar]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Huo, F.; Zhu, X.; Zhang, L.; Liu, Q.; Shu, Y. Efficient Context-Guided Stacked Refinement Network for RGB-T Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 3111–3124. [Google Scholar] [CrossRef]
Liu, Z.; Tan, Y.; He, Q.; Xiao, Y. SwinNet: Swin Transformer Drives Edge-Aware RGB-D and RGB-T Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4486–4497. [Google Scholar] [CrossRef]
Tu, Z.; Li, Z.; Li, C.; Lang, Y.; Tang, J. Multi-Interactive Dual-Decoder for RGB-Thermal Salient Object Detection. IEEE Trans. Image Process. 2021, 30, 5678–5691. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Li, H.; Wang, G.; Xu, D.; Yang, J.; Yue, G.; Liu, X. DPSNET: A Dual-path Lightweight Network with Semantic-guided Cross-modal Feature Fusion for Multi-modal Object Detection. IEEE Trans. Instrum. Meas. 2025, 74, 5045817. [Google Scholar] [CrossRef]
Cai, Z.; Ma, Y.; Huang, J.; Mei, X.; Fan, F. Correlation-guided discriminative cross-modality features network for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2023, 73, 5002718. [Google Scholar] [CrossRef]
Cong, R.; Zhang, K.; Zhang, C.; Zheng, F.; Zhao, Y.; Huang, Q.; Kwong, S. Does Thermal Really Always Matter for RGB-T Salient Object Detection? IEEE Trans. Multimed. 2023, 25, 6971–6982. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, Y.; Lei, J.; Yang, R.; Yu, L. LSNet: Lightweight Spatial Boosting Network for Detecting Salient Objects in RGB-Thermal Images. IEEE Trans. Image Process. 2023, 32, 1329–1340. [Google Scholar] [CrossRef] [PubMed]
Song, K.; Bao, Y.; Wang, H.; Huang, L.; Yan, Y. A Potential Vision-Based Measurements Technology: Information Flow Fusion Detection Method Using RGB-Thermal Infrared Images. IEEE Trans. Instrum. Meas. 2023, 72, 5004813. [Google Scholar] [CrossRef]
Lyu, P.; Yu, X.; Wu, C.; Rajapakse, J.C. Deep Fourier-Embedded Network for Bi-Modal Salient Object Detection. arXiv 2024, arXiv:2411.18409. [Google Scholar]
Zhang, Y.; Gao, H.; Sohel, F.; Wu, F.; Muzahid, A.A.M.; Zhao, J.; Du, Z.; Zhang, L. Multimodal Stream Focusing Salient Object Detection Based on Visible–Infrared Complementary Fusion. IEEE Trans. Instrum. Meas. 2025, 74, 5046614. [Google Scholar]
Zhou, H.; Hong, W.; Zhang, Z.; Liu, X.; Wu, X.J. Lightweight Spatial-Channel-Frequency Network for RGB-Thermal Salient Object Detection. IEEE Signal Process. Lett. 2025, 32, 4009–4013. [Google Scholar] [CrossRef]
Wu, W.; Zhang, X.; Yin, H.; Dai, S.; Zhang, H.; Zhang, Y. FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection. arXiv 2025, arXiv:2511.10046. [Google Scholar]
Li, K.; Wang, D.; Hu, Z.; Li, S.; Ni, W.; Zhao, L.; Wang, Q. Fd2-net: Frequency-driven feature decomposition network for infrared-visible object detection. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2025; Volume 39, pp. 4797–4805. [Google Scholar] [CrossRef]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 10819–10829. [Google Scholar]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Tang, J.; Fan, D.; Wang, X.; Tu, Z.; Li, C. RGBT salient object detection: Benchmark and a novel cooperative ranking approach. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4421–4433. [Google Scholar] [CrossRef]
Tu, Z.; Xia, T.; Li, C.; Wang, X.; Ma, Y.; Tang, J. RGB-T Image Saliency Detection via Collaborative Graph Learning. IEEE Trans. Multimed. 2019, 22, 160–173. [Google Scholar] [CrossRef]
Song, K.; Huang, L.; Gong, A.; Yan, Y. Multiple Graph Affinity Interactive Network and a Variable Illumination Dataset for RGBT Image Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 3104–3118. [Google Scholar] [CrossRef]
Huo, F.; Zhu, X.; Zhang, Q.; Liu, Z.; Yu, W. Real-Time One-Stream Semantic-Guided Refinement Network for RGB-Thermal Salient Object Detection. IEEE Trans. Instrum. Meas. 2022, 71, 2512512. [Google Scholar] [CrossRef]
Wang, K.; Tu, Z.; Li, C.; Zhang, C.; Luo, B. Learning Adaptive Fusion Bank for Multi-Modal Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7344–7358. [Google Scholar] [CrossRef]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-Alignment Measure for Binary Foreground Map Evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar] [CrossRef]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-Measure: A New Way to Evaluate Foreground Maps. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 4548–4557. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-Tuned Salient Region Detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2009; pp. 1597–1604. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to Evaluate Foreground Maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2014; pp. 248–255. [Google Scholar]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency Filters: Contrast Based Filtering for Salient Region Detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2012; pp. 733–740. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 5802–5811. [Google Scholar]

Figure 1. Overall architecture of the proposed infrared and visible image object detection algorithm. The framework primarily comprises four core components: the FREFormer Block for feature extraction, the Homogeneous Frequency Refined Block (HFRB) for frequency-domain enhancement, and the Heterogeneous Spatial-Channel Frequency Fusion for frequency-domain selection, as well as the Reconstruction Guided Module (FRGM) for multi-modal feature fusion and reconstruction.

Figure 2. Structural details of the FREFormer Encoder. This architecture adopts a hierarchical design consisting of stacked FREFormer Blocks with downsampling layers for multi-stage feature extraction. The central component in the FREFormer is the learnable selective filter generator, which transforms input features into the frequency domain via 2D FFT. The processed frequency-domain features are then mapped back to the spatial domain through 2D IFFT, followed by a Channel MLP for inter-channel feature mixing.

Figure 3. Architecture of the CFGIM. The left panel illustrates the Homogeneous Frequency Refined Block (HFRB), which extracts multi-scale features via global weights

w_{g l o b a l}

and Hadamard products (HADs). The right panel displays the Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB), designed to fuse frequency-domain features across spatial and channel dimensions using 2D FFT, Channel Fourier Transform (CFFT), and complex operations.

Figure 3. Architecture of the CFGIM. The left panel illustrates the Homogeneous Frequency Refined Block (HFRB), which extracts multi-scale features via global weights

w_{g l o b a l}

and Hadamard products (HADs). The right panel displays the Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB), designed to fuse frequency-domain features across spatial and channel dimensions using 2D FFT, Channel Fourier Transform (CFFT), and complex operations.

Figure 4. Qualitative comparison results on the VT821 dataset. The proposed algorithm outputs clearer target edges and fewer background interferences compared with other methods.

Figure 5. Qualitative comparison results on the VT1000 dataset. The proposed algorithm maintains a complete target structure in occluded and irregular target scenes.

Figure 6. Qualitative comparison results on the VI-RGBT1500 dataset. The proposed algorithm adapts to diverse illumination scenes and maintains high detection accuracy.

Figure 7. Qualitative comparison results on the VT5000 dataset. The proposed algorithm accurately detects small and occluded targets in large-scale complex scenes.

Figure 8. Visualization of feature evolution under challenging scenarios (e.g., heavy fog, haze, strong glare, and smoke occlusion). The columns from left to right display: the Input RGB Image, the Input Infrared Image, the feature after the encoder (FREFormer), the feature after the HFRB, and the feature after the decoder (FRGM). It can be observed that even when the visible spectrum is severely degraded, the proposed modules effectively integrate infrared information to highlight salient targets and suppress background interference.

Table 1. Quantitative results on the VT821 dataset (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Algorithm	E-Measure ↑	S-Measure ↑	$F_{β}$ ↑	$ω F_{β}$ ↑	MAE ↓
CSRNet [19]	0.864	0.860	0.830	0.826	0.036
MIDD [35]	0.892	0.874	0.819	0.806	0.046
OSRNet [18]	0.869	0.866	0.822	0.801	0.042
SwinNet [22]	0.900	0.893	0.844	0.847	0.030
TNet [23]	0.895	0.879	0.832	0.842	0.031
LSNet [24]	0.879	0.874	0.825	0.813	0.034
IFFNet [25]	0.901	0.882	0.849	0.848	0.029
DFENet [36]	0.913	0.904	0.866	0.862	0.023
LAFB [2]	0.891	0.889	0.861	0.863	0.026
Ours	0.934	0.908	0.875	0.868	0.025

Table 2. Quantitative results on the VT1000 dataset (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Algorithm	E-Measure ↑	S-Measure ↑	$F_{β}$ ↑	$ω F_{β}$ ↑	MAE ↓
CSRNet [19]	0.877	0.873	0.841	0.850	0.036
MIDD [35]	0.894	0.890	0.843	0.849	0.042
OSRNet [18]	0.885	0.874	0.851	0.841	0.037
SwinNet [22]	0.900	0.898	0.848	0.838	0.028
TNet [23]	0.903	0.895	0.862	0.860	0.028
LSNet [24]	0.896	0.892	0.854	0.852	0.031
IFFNet [25]	0.911	0.906	0.869	0.872	0.025
DFENet [36]	0.921	0.912	0.882	0.881	0.019
LAFB [2]	0.914	0.913	0.878	0.876	0.021
Ours	0.951	0.940	0.918	0.922	0.016

Table 3. Quantitative results on the VI-RGBT1500 dataset (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance) [34].

Algorithm	E-Measure ↑	S-Measure ↑	$F_{β}$ ↑	$ω F_{β}$ ↑	MAE ↓
CSRNet [19]	0.857	0.842	0.804	0.820	0.039
MIDD [35]	0.834	0.838	0.811	0.819	0.037
OSRNet [18]	0.891	0.855	0.807	0.829	0.031
SwinNet [22]	0.873	0.824	0.821	0.832	0.038
TNet [23]	0.900	0.862	0.835	0.857	0.034
LSNet [24]	0.911	0.894	0.826	0.857	0.031
IFFNet [25]	0.909	0.896	0.843	0.863	0.028
DFENet [36]	0.922	0.912	0.877	0.870	0.022
LAFB [2]	0.917	0.911	0.864	0.862	0.025
Ours	0.926	0.912	0.876	0.874	0.020

Table 4. Quantitative results on the VT5000 dataset (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Algorithm	E-Measure ↑	S-Measure ↑	$F_{β}$ ↑	$ω F_{β}$ ↑	MAE ↓
CSRNet [19]	0.872	0.861	0.819	0.820	0.032
MIDD [35]	0.876	0.866	0.826	0.819	0.033
OSRNet [18]	0.882	0.854	0.822	0.828	0.029
SwinNet [22]	0.871	0.876	0.821	0.830	0.035
TNet [23]	0.894	0.881	0.834	0.824	0.030
LSNet [24]	0.898	0.884	0.843	0.841	0.036
IFFNet [25]	0.904	0.899	0.842	0.842	0.028
DFENet [36]	0.918	0.905	0.861	0.864	0.024
LAFB [2]	0.912	0.906	0.854	0.878	0.024
Ours	0.920	0.909	0.867	0.869	0.023

Table 5. Complexity analysis on the VT5000 dataset (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Algorithm	Backbone	Params (M) ↓	FLOPs (G) ↓	FPS ↑	E-Measure ↑
CSRNet [19]	ESPNetV2	256.74	326.04	178	0.872
MIDD [35]	VGG16	200.00	216.72	33	0.876
OSRNet [18]	ResNet50	59.67	51.27	142	0.882
SwinNet [22]	SwinTransformer	88.5	45.25	22	0.871
TNet [23]	ResNet50	256.74	326.04	178	0.894
LSNet [24]	MobileNetV2	17.41	3.04	314	0.898
IFFNet [25]	ResNet50	123.77	241.2	74	0.904
DFENet [36]	VGG16	398.05	552.1	37	0.918
LAFB [2]	ResNet50	223.45	235.66	87	0.912
Ours	FREFormer	36.04	5.78	277	0.920

Table 6. Ablation study results on the VT821 dataset (Best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Components			Metrics
HFRB	HSFB	FRGM	$E_{ξ}$ ↑	$S_{α}$ ↑	$F_{β}$ ↑	$ω F_{β}$ ↑	MAE ↓
×	✓	✓	0.911	0.898	0.862	0.860	0.029
✓	×	✓	0.910	0.896	0.861	0.859	0.032
✓	✓	×	0.912	0.901	0.865	0.862	0.027
✓	✓	✓	0.934	0.908	0.875	0.868	0.025

Table 7. Ablation study results on the VT5000 dataset (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Components			Metrics
HFRB	HSFB	FRGM	$E_{ξ}$ ↑	$S_{α}$ ↑	$F_{β}$ ↑	$ω F_{β}$ ↑	MAE ↓
×	✓	✓	0.912	0.897	0.862	0.862	0.030
✓	×	✓	0.911	0.895	0.861	0.860	0.034
✓	✓	×	0.914	0.903	0.865	0.862	0.028
✓	✓	✓	0.920	0.909	0.867	0.869	0.023

Table 8. Component-wise ablation study on VT5000. LSF: learnable selective filter. S-Freq: Spatial Frequency Branch. C-Freq: Channel Frequency Branch (best results in bold, ↑ denotes that higher values correspond to better performance and ↓ denotes that lower values correspond to better performance).

Components			Metrics
LSF	S-Freq	C-Freq	$E_{ξ}$ ↑	$S_{α}$ ↑	$F_{β}$ ↑	MAE ↓
×	✓	✓	0.833	0.802	0.819	0.046
✓	×	✓	0.891	0.878	0.835	0.039
✓	✓	×	0.865	0.863	0.831	0.037
✓	✓	✓	0.920	0.909	0.867	0.023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, G.; Sun, Y.; Chen, K.; Yu, Y.; Deng, L.; Zhu, H. SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection. Mathematics 2026, 14, 969. https://doi.org/10.3390/math14060969

AMA Style

Xu G, Sun Y, Chen K, Yu Y, Deng L, Zhu H. SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection. Mathematics. 2026; 14(6):969. https://doi.org/10.3390/math14060969

Chicago/Turabian Style

Xu, Guoxia, Yulong Sun, Kang Chen, Yufeng Yu, Lizhen Deng, and Hu Zhu. 2026. "SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection" Mathematics 14, no. 6: 969. https://doi.org/10.3390/math14060969

APA Style

Xu, G., Sun, Y., Chen, K., Yu, Y., Deng, L., & Zhu, H. (2026). SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection. Mathematics, 14(6), 969. https://doi.org/10.3390/math14060969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCFusion: Spatial-Channel Cross-Frequency Guided Fusion Network for Infrared–Visible Image Object Detection

Abstract

1. Introduction

2. Related Work

2.1. CNN/Transformer-Based Infrared and Visible Image Object Detection

2.2. Frequency-Domain Modeling for Infrared and Visible Image Detection

3. SCFusion Method

3.1. Overall Network Architecture

3.2. FREFormer Encoder of Infrared and Visible Image

3.3. Cross-Frequency Guided Interaction Module (CFGIM)

3.3.1. Homogeneous Frequency Refined Block (HFRB)

3.3.2. Heterogeneous Spatial-Channel Frequency Fusion Block (HSFB)

3.4. Frequency Reconstruction Guided Module (FRGM) for Decoder

3.5. Loss Function

4. Experiments

4.1. Experimental Settings

4.1.1. Enhanced-Alignment Measure

4.1.2. Structure Measure

4.1.3. F-Measure

4.1.4. Weighted F-Measure

4.1.5. Mean Absolute Error

4.2. Experimental Results and Analysis

4.2.1. Results on VT821 Dataset

4.2.2. Results on VT1000 Dataset

4.2.3. Results on VI-RGBT1500 Dataset

4.2.4. Results on VT5000 Dataset

4.2.5. Computational Efficiency Analysis

4.3. Robustness Analysis in Hazy and Complex Scenarios

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI