MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection

Zhao, Zehao; Chen, Weining; Dong, Seng; Chen, Yaohong; Wang, Hao

doi:10.3390/rs17173070

Open AccessArticle

MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection

by

Zehao Zhao

^1,2,3,

Weining Chen

^1,3,

Seng Dong

^1,3,

Yaohong Chen

^1,3 and

Hao Wang

^1,3,*

¹

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

University of Chinese Academy of Sciences, Beijing 100190, China

³

Xi’An Key Laboratory of Spacecraft Optical Imaging and Measurement Technology, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3070; https://doi.org/10.3390/rs17173070

Submission received: 24 July 2025 / Revised: 23 August 2025 / Accepted: 25 August 2025 / Published: 3 September 2025

Download

Browse Figures

Versions Notes

Abstract

Infrared small target detection is a critical task in remote sensing applications, such as aerial reconnaissance, maritime surveillance, and early-warning systems. However, due to the inherent characteristics of remote sensing imagery, such as complex backgrounds, low contrast, and limited spatial resolution-detecting small-scale, dim infrared targets remains highly challenging. To address these issues, we propose MFAFNet, a novel Multi-Feature Attention Fusion Network tailored for infrared remote sensing scenarios. The network comprises three key modules: a Feature Interactive Fusion Module (FIFM), a Patch Attention Block (PAB), and an Asymmetric Contextual Fusion Module (ACFM). FIFM enhances target saliency by integrating the original infrared image with two locally enhanced feature maps capturing different receptive field scales. PAB exploits global contextual relationships by computing inter-pixel correlations across multi-scale patches, thus improving detection robustness in cluttered remote scenes. ACFM further refines feature representation by combining shallow spatial details with deep semantic cues, alleviating semantic gaps across feature hierarchies. Experimental results on two public remote sensing datasets, SIRST-Aug and IRSTD-1k, demonstrate that MFAFNet achieves excellent performance, with mean IoU values of 0.7465 and 0.6701, respectively, confirming its effectiveness and generalizability in infrared remote sensing image analysis.

Keywords:

infrared small targets detection; feature fusion; self-attention; asymmetric fusion; remote sensing

1. Introduction

Infrared small target detection (IRSTD) is a crucial task in the field of remote sensing, with widespread applications in early warning systems, ground surveillance, maritime monitoring, and forest fire prevention [1]. As an essential component of infrared search and tracking (IRST) systems, this task benefits from the unique advantages of infrared imaging—such as long-range perception, all-weather operability, and passive sensing-making it highly suitable for remote sensing platforms, including unmanned aerial vehicles, spaceborne satellites, and high-altitude balloons.

Despite these advantages, infrared small target detection in remote sensing imagery faces persistent challenges distinct from conventional object detection. Specifically, due to long-distance imaging, large-scale scene variability, and low signal-to-clutter ratio (SCR), the target typically occupies only a few pixels and lacks distinctive shape, color, or texture information. Furthermore, background clutter and sensor-induced noise often overwhelm the weak target signals, making reliable detection particularly difficult in complex or heterogeneous environments [2].

The current infrared small target detection methods can be broadly classified into two categories: single-frame-based (SFB) and multi-frame-based (MFB) methods. For example, Han et al. [3] combine the salience of the time domain and the phase spectrum of the quaternion Fourier transform to achieve high-performance small target detection. Sun et al. [4] combine the pixel detection and trained LightGBM model to detect targets in individual frames, establishing short-strict and long-loose constraints to distinguish the target trajectory. MFB methods make use of spatial and temporal information in order to achieve detection, while SFB methods are capable of detecting small targets in each individual frame with minimal computational cost.

In the past decades, the majority of SFB methods have been model-driven due to the lack of public datasets, including filtering-based methods, local-contrast-based methods, and low-rank-based methods [5,6], local-contrast-based methods [7,8,9], and low-rank-based methods [10,11,12]. However, these model-driven methods heavily rely on prior assumptions. Constrained by the fact that these prior assumptions do not match all the complex scenes, these methods cannot be applied in the actual detection task. Furthermore, model-driven methods are typically susceptible to hyper-parameter sensitivity, which renders the utilization of handcrafted features and fixed hyper-parameters in diverse scenarios challenging.

With the recent emergence of open-source infrared datasets, data-driven approaches based on deep learning have gained momentum. In particular, convolutional neural networks (CNNs) have demonstrated promising capabilities in learning discriminative features directly from infrared imagery [13,14]. These methods offer improved generalization and scene adaptability. However, CNN-based architectures may suppress weak target features in deeper layers due to their tendency to prioritize prominent background patterns. Moreover, the reliance on large, high-quality training datasets poses a major constraint in infrared remote sensing, where annotated data is costly and scarce.

To address the limitations of CNNs, transformer-based models have been introduced into IRSTD, leveraging their global attention mechanisms to enhance long-range dependency modeling [15,16]. However, transformer architectures often overlook fine-grained local details and are computationally intensive—an issue particularly problematic for small targets embedded in high-resolution remote sensing imagery.

To address the challenges of detecting small infrared targets in complex remote sensing scenes, we propose MFAFNet, a Multi-Feature Attention Fusion Network that integrates the local detail modeling capability of CNNs with the global contextual reasoning of transformers. The network introduces a Multi-Feature Interactive Fusion Module to enhance target saliency through progressive fusion of multi-scale features, a Patch Attention Block to capture inter-pixel dependencies via multi-scale attention, and an improved Asymmetric Contextual Fusion Module (ACFM) to align spatial and semantic information across network layers. This unified design enables MFAFNet to effectively suppress clutter and highlight small targets, significantly improving detection performance in infrared remote sensing imagery. The main contributions of this paper are summarized as follows:

(1): We propose MFAFNet, a novel multi-feature attention fusion network designed specifically for infrared small target detection in remote sensing imagery. By integrating CNN-based local perception and transformer-inspired global reasoning within a unified architecture, MFAFNet effectively addresses the challenges of low signal-to-clutter ratio and limited spatial resolution.
(2): We design a Multi-Feature Interactive Fusion Module (FIFM) that progressively combines the original input with multi-scale local-enhanced feature maps. This fusion enhances the representation of weak targets by emphasizing their saliency and suppressing background interference.
(3): We introduce a Patch Attention Block (PAB) with cascaded channel-spatial attention, enabling the model to capture long-range dependencies and contextual relationships among patches at multiple scales. This improves target localization in cluttered and heterogeneous infrared scenes.
(4): We enhance the Asymmetric Contextual Fusion Module to integrate low-level spatial features with high-level semantic information. By aligning hierarchical feature distributions, this module reduces semantic gaps and supports more accurate target discrimination.

The remainder of this paper is organized as follows. Section 2 reviews related work in infrared small target detection. Section 3 presents the architecture and modules of the proposed MFAFNet in detail. Section 4 reports the experimental setup and comparative evaluation results. Section 5 concludes the paper and outlines future research directions.

2. Related Work

In recent years, infrared small target detection has attracted increasing attention in the remote sensing community due to its critical role in aerial surveillance, early warning, and environmental monitoring. To better understand the limitations of existing methods and motivate our proposed approach, we review representative model-driven algorithms, deep learning-based strategies, and transformer-enhanced techniques.

2.1. Traditional Model-Driven Approaches

In the field of infrared small target detection, conventional model-driven methods have long dominated due to the absence of large-scale, publicly available datasets. These approaches typically rely on handcrafted priors to exploit the local discontinuity between the target and its background. Among them, human visual system (HVS)-inspired methods have received notable attention, aiming to enhance target saliency by modeling local contrast and suppressing background clutter [17].

A representative example is the Local Contrast Measure (LCM) [7], which employs a sliding window to compute pixel-wise contrast and applies adaptive thresholding to locate potential targets. This method has inspired several extensions, such as Improved LCM (ILCM) [18], Relative LCM (RLCM) [19], Weighted Strengthened LCM (WSLCM) [20], and Tri-layer LCM (TLLCM) [21], each aiming to improve robustness across variable scenes. Wei et al. [8] introduced MPCM, a multi-scale patch-based contrast descriptor, which was further refined by incorporating the Local Energy Factor (LEF) [22]. Zhang et al. [9] proposed a combined Local Intensity and Gradient (LIG) framework to improve both target enhancement and background suppression.

While these methods are interpretable and computationally efficient, they rely heavily on strict prior assumptions, such as uniform background, high SCR, or fixed target morphology-which do not generalize well to real-world remote sensing imagery. As a result, their performance drops significantly in complex scenes with dynamic textures, multi-scale clutter, and unknown noise patterns, as shown in Figure 1.

To overcome these limitations, we adopt a data-driven, learning-based framework that abandons fixed priors and instead learns discriminative features directly from annotated samples across diverse backgrounds and target conditions.

2.2. Data-Driven Deep Learning Approaches

With the increasing availability of annotated infrared datasets, deep learning-based methods have rapidly emerged as alternatives to traditional algorithms. These data-driven methods can be broadly categorized into detection-based and segmentation-based paradigms.

The pioneering work of Liu et al. [23] introduced a multilayer perceptual detection network for small targets in infrared imagery, marking the transition to deep learning in SIRST. More recent efforts favor segmentation-based frameworks due to their ability to generate pixel-level target masks, which are particularly suitable for small and ambiguous targets in cluttered remote sensing environments. Dai et al. [14] proposed segmentation networks with asymmetric context modules to fuse shallow and deep features. Wang et al. [13] decomposed the detection task into two subtasks-reducing miss detections and false alarms-using generative adversarial networks (GANs) to optimize the tradeoff.

Other studies have explored hybrid strategies that integrate handcrafted priors with learned features. For instance, Hou et al. [24] constructed a mapping network from handcrafted descriptors to probabilistic maps, while Yang et al. [25] designed a depth-aware fusion module to combine original and smoothed features. Chen et al. [26] employed a supervised attention module trained on a target spread map to suppress irrelevant background noise.

The integration of attention mechanisms and transformers has further advanced IRSTD performance. Liu et al. [27] introduced a transformer with a feature enhancement module to model long-range feature dependencies. Li et al. [28] proposed a cascaded channel-spatial attention module (CSAM) to selectively enhance multi-scale features. IAANet [15] adopts a coarse-to-fine pipeline where CNNs locate rough regions and transformers refine pixel-level details. Similarly, AGPCNet [16] combines convolutional encoding with non-local attention for contextual refinement.

Despite their progress, existing CNN-based methods often suffer from limited receptive fields, which restrict their ability to model long-range dependencies. Moreover, as target features become progressively weaker in deep layers, they are often overwhelmed by dominant background structures. These methods also struggle with balancing generalization and precision, particularly under low SCR and scene heterogeneity.

To mitigate these issues, we design a multi-feature fusion mechanism that explicitly preserves and enhances small target features across network layers. Our architecture leverages both local detail modeling via CNNs and global semantic reasoning through attention mechanisms to maintain target discriminability throughout the network.

2.3. Transformer-Based Methods in Vision

Inspired by the success of self-attention in NLP [29], researchers have extended transformer architectures to computer vision tasks. Vision Transformer (ViT) [30] applies non-overlapping image patches for classification, while non-local blocks [31] generalize long-range interaction modeling in feature space. Although transformers are effective in capturing global context, they typically require large datasets and introduce substantial computational overhead. Moreover, their limited capacity for modeling local structures poses challenges for detecting small, dim targets.

To mitigate these issues, hybrid models combining CNNs with transformer-based attention have been proposed [32,33]. For instance, Liu et al. [34] leverage transformer layers to explore semantic dependencies while enhancing feature discriminability through channel-aware refinement. These hybrid networks aim to balance global perception and local precision, making them increasingly attractive for infrared small target detection in remote sensing scenarios.

Although transformers offer strong global modeling capacity, they typically require large-scale training data and introduce significant computational cost. Their attention mechanisms also tend to underperform in capturing fine-grained local details, which are crucial for detecting dim and tiny targets. In remote sensing, where high-resolution imagery and computational constraints are prevalent, these limitations hinder practical deployment.

We propose a hybrid fusion network, MFAFNet, that integrates CNNs and transformer-inspired attention in a layered and lightweight manner. Instead of relying on pure self-attention, we use multi-scale patch attention and asymmetric fusion modules to selectively aggregate spatial details and semantic contexts, ensuring both efficiency and precision in infrared remote sensing imagery.

3. Methods

In this section, we present the architecture of the proposed Multi-Feature Attention Fusion Network for infrared small target detection. We begin by describing the overall framework, followed by detailed explanations of the Multi-Feature Interactive Fusion Module and the Patch Attention Block. Finally, we introduce the enhanced Asymmetric Contextual Fusion Module designed to refine semantic integration and produce accurate detection results.

3.1. Overall Architecture

The overall architecture of the proposed method is illustrated in Figure 2. It consists of three main modules: a multi-feature fusion module to extract the features, a patch attention block to obtain long-range associations, and an improved asymmetric fusion mapping network to up-sample and generate the detection results.

As shown in Figure 2, MFAFNet takes three inputs: the raw infrared image, a local intensity image, and a local gradient image. The latter two are derived via human visual system-inspired preprocessing, which enhances small target saliency. These inputs are passed through a ResNet-50 backbone, where multi-feature fusion is embedded between residual stages through the FIFM module, allowing progressive enhancement of spatial and morphological features. The resulting feature maps are fed into the PAB module, which models inter-pixel dependencies across spatially distributed patches to capture long-range contextual associations. Finally, a U-Net-style skip connection is employed to merge low-level and high-level features. These are processed by the ACFM, which addresses feature misalignment and generates a refined binary detection map through fully connected layers. The final output highlights the spatial location of infrared small targets.

3.2. Multi-Feature Interactive Fusion Module

In ResNet50, an image is fed into a sequence of convolution layers with a down-sample, which submerges the infrared small target. Inspired by the HVS-based methods, we introduced the local intensity and gradient feature into neural network, which can enhance the salience of the target and improve the detection performance. The intensity and gradient features are fused with the morphological feature in the original image. The fusion is carried out through the redesigned multi-feature interactive fusion module.

In general, the grayscale distribution of the infrared small target exhibits an approximate centrosymmetric shape and radiates, which indicates that the small target can be described by a two-dimensional Gaussian function (2-D Gaussian function). Consequently, the local intensity and gradient operation can effectively distinguish the candidate targets. Figure 3 shows the intensity and gradient features of the infrared small target and some background. It can be observed that the majority of pixels belonging to the target have been enhanced, while those belonging to the clutter have been suppressed. Nevertheless, the grayscale distributions of certain background clutter exhibit similarities with those of the small target. The processed results still contain a substantial amount of background clutter. Conversely, a single local feature can enhance the pixels of the background, which introduces greater intrusion. As a result, we propose the multi-feature fusion module, which comprehensively utilizes multiple features and incorporation of diverse constraints. The intensity and gradient feature calculations are defined as follows:

I = m a x (0, f_{0} - \bar{f})

(1)

G = \{\begin{matrix} \sum_{i = 1}^{4} G_{i}, & i f \frac{G_{m i n}}{G_{m a x}} > k \\ 0, & o t h e r w i s e \end{matrix}

(2)

Our model employs a convolutional neural network (CNN), adapting ResNet20 as the underlying architecture. The key distinction lies in the multi-feature interactive fusion module, which serves to extract comprehensive features pertaining to the target. The feature-fusion strategy in each layer enables the efficient fusion of the local salient feature and the original morphological feature, thereby enhancing the semantic content of the infrared small target in the feature map. The fusion strategy is implemented in an additive mode in the current study. Consequently, our method entails the fusion of each local feature with the original feature map, followed by the superposition of the results.

The multi-feature interactive fusion module is employed in the feature extraction, as shown in Figure 4. The FIFM fuses the original image and two additional local feature images at varying depths. In our method, the original feature, intensity feature, and gradient feature maps are employed as the input. Considering the identical sampling rate of the original feature map and the two additional local feature maps, the FIFM has three input feature maps

O_{t}, I_{t}, G_{t} {\in R}^{C \times W \times H}

with the same size. The FIFM initially performs a fusion on the original feature map and one of the multiple features. Subsequently, the two sets of fusion results shown in Figure 4 are added to the output

{O_{t}}^{'}

of the FIFM in layer

t

and the input

O_{t + 1}

of the convolutional layer

t + 1

. The other two sets of fusion results shown in Figure 4 were employed as the input

I_{t + 1}, G_{t + 1}

of the convolutional layer

t + 1

. The

O_{t + 1}, I_{t + 1}, G_{t + 1} {\in R}^{C \times W \times H}

have the same size with

O_{t}, I_{t}, G_{t}

.

O_{t}^{'} = O_{t} \cdot I_{t} + O_{t} \cdot G_{t}

(3)

I_{t}^{'} = O_{t} \cdot I_{t}

(4)

G_{t}^{'} = O_{t} \cdot G_{t}

(5)

where

O_{t}, I_{t}, G_{t}

are the original feature map and two other local feature maps in the layer

t

.

{O_{t}}^{'}, {I_{t}}^{'}, {G_{t}}^{'}

are feature maps.

The feature fusion layer injects the rich features of the small target in intensity images and gradient images into the original feature map. Concurrently, the noise and clutter can be mitigated by the disparity between the two local feature images. To further stabilize this process, we introduce a threshold parameter k when computing gradient features, which physically corresponds to the minimum detectable edge contrast in infrared imagery. Gradients below this threshold are typically caused by background noise rather than actual targets. The value of k is determined based on the gradient magnitude histogram, ensuring genuine small target edges (spanning 2 × 2–3 × 3 pixels) are retained while weak clutter is suppressed.

3.3. Patch Attention Block

The patch attention block is the fundamental module of the network, as shown in Figure 5. The PAB incorporates a multi-scale non-local block with a cascaded channel and spatial attention module (CSAM).

The receptive field of the CNNs is the reason why the context module should be introduced to obtain the long-range dependencies between pixels. The non-local network can utilize the global computation in order to capture the correlation between each pixel and other pixels. As shown in Figure 6, the non-local block can be embedded into the neural network and effectively improves the performance as a key component. Although the non-local block is able to aggregate the pixels of the same type, it has a lower calculation efficiency.

The non-local block is applied directly to the original feature

X^{'} \in R^{C \times W \times H}

with high space complexity. In the meantime, the smaller targets occupy a few pixels in the images. Consequently, the multi-scale non-local block is employed to reduce the calculation, and the feature map is divided

X^{'}

into

s \times s

patches of size

w \times h

, where

w = \frac{W}{s}, h = \frac{H}{s}

. In our method, the patch factor

s

is:

s = [10, 6, 5, 4]

(6)

Subsequently, each patch is fed into a non-local block in order to calculate the dependencies of pixels within a local range. Patches from the same feature map share weights, and these outputs are concentrated to generate a local associated feature map

P \in R^{C \times W \times H}

. The replacement of the full feature map

X^{'}

by the local patch results in a reduction of the receptive field of the network. This reduction is achieved by calculating the dependencies between pixels in the local patch, with the objective of aggregating pixels of the same type. In each patch, the local feature map is capable of distinguishing the type of pixels, reducing the influence of noise and clutter. In this way, the algorithmic complexity and computational time are markedly reduced by computing the local association in each patch. The chosen patch sizes [4,5,6,10] are not arbitrary but are designed to match the scale of infrared small targets. After downsampling, one feature pixel corresponds to ~8 × 8 pixels of the original image, meaning that patches of size 4–6 effectively cover targets of 2 × 2–3 × 3 pixels plus contextual surroundings, while larger patches (e.g., 10) extend the field of view for robustness in cluttered scenes.

In each patch with the same size

w \times h

, the output of the non-local block has been concentrated into a new associated feature map

P \in R^{C \times W \times H}

. Meanwhile, the input feature map

X^{'}

are also fed into multiple PABs of the different patch scales in parallel. For each PAB, the result can be formulated as

{{P A B}^{s_{1}}, {P A B}^{s_{2}}, \dots}

, where

s_{i}

is the size of the patch. The local associated feature map and the original feature map

X^{'}

have been combined along the scale axis to generate the new feature map

L \in R^{((n + 1) \times C) \times W \times H}

, and the shape of the feature map has been added in dimension.

For the concatenated feature map

L \in R^{((n + 1) \times C) \times W \times H}

, a cascaded channel and spatial attention module is employed to achieve adaptive feature enhancement. As illustrated in Figure 7, the feature map has undergone modification.

L^{″} \in R^{C \times W \times H}

as the final feature map. The CSAM is comprised of two cascaded attention units: a one-dimensional channel attention and a two-dimensional spatial attention. The channel attention can be obtained as follows:

L' = σ [M L P (P_{m a x} (L)) + M L P (P_{a v g} (L))] \otimes L

(7)

where

\otimes

denotes the element-wise multiplication,

σ

denotes the sigmoid function. The

Ρ_{m a x}

and

Ρ_{a v g}

denote the max pooling and the average pooling, respectively.

In contrast to the channel attention, the spatial attention can be succinctly described as follows:

L ″ = σ [f^{7 \times 7} (P_{m a x} (L')), (P_{a v g} (L'))] \otimes L

(8)

where

f^{7 \times 7}

represents a convolution with a

7 \times 7

kernel. The pooling operation is applied in both the spatial and channel dimensions to generate the weight matrix

M_{C} \in R^{C \times 1 \times 1}, M_{S} \in R^{1 \times W \times H}

, which is the result of the channel attention and spatial attention.

It is worth noting that, although non-local attention is typically high in complexity, the patch-wise multi-scale design used in PAB restricts computation to local regions, thereby reducing complexity from quadratic to linear with respect to feature map size and ensuring practical efficiency.

3.4. Asymmetric Contextual Fusion Module

To integrate the subtle features and the coarse semantic features from the higher layers, we develop a novel asymmetric contextual fusion module, which is inspired by the ACM. As shown in Figure 8, the low-level semantics

X_{l}

and the deep-level semantics

X_{d}

are fed into ACFM as input. Rather than ACM, our approach initially fuses the low-level and deep-level semantics, subsequently generating the pixel and channel attention weight matrices to guide the fusion.

W_{c a} = σ (B (W_{2} δ (B (W_{1} P (X_{d})))))

(9)

W_{p a} = σ (B (W_{2} δ (B (W_{1} X_{l}))))

(10)

The point-attention mechanism, as defined in Equation (10), is employed in this study. We use

1 \times 1

convolution dimensionality reduction and the channel-attention mechanism in Equation (9). Where

p, B, δ, σ

denote the global average pooling, batch normalization, the rectified linear unit, sigmoid function, respectively. Based on the weight matrix

W_{c a}, W_{p a}

, The low-level and deep-level semantics are unified by introducing a constraint that is applied separately to each level. As illustrated in Equation (11), our module is capable of resolving the discrepancy between the low-level semantics.

X_{l}

and the deep-level semantics

X_{d}

, and reduce the difference caused by separate constraints.

X_{A C F M} = W_{c a} \otimes X_{l} + W_{p a} \otimes X_{d}

(11)

As shown in Figure 9, the heatmap comparison provides direct qualitative evidence of the effectiveness of the ACFM. Before fusion, the low-level spatial features exhibit strong responses along local textures and background clutter, which leads to noisy activations and potential false alarms. In contrast, the high-level semantic features focus more on the approximate target location but suffer from blurred boundaries and incomplete structural representation, reflecting a semantic gap between the two levels. After applying ACFM, the fused heatmap demonstrates a clear improvement: activations are concentrated within the true target regions, boundaries become sharper, and irrelevant background responses are significantly suppressed. This indicates that ACFM successfully aligns low-level details with high-level semantics, thereby reducing the semantic gap and enhancing the discriminative capability of the network in cluttered infrared environments.

3.5. Loss Function

To optimize the performance of MFAFNet in detecting small infrared targets under complex backgrounds, we construct a hybrid loss function that incorporates both pixel-level accuracy and structure-aware constraints. Given the severe foreground-background imbalance and the sparse distribution of small targets, traditional binary cross-entropy loss is insufficient to guide precise learning. To address this, we integrate Soft-IoU loss and a semantic consistency loss, and introduce a feature supervision loss at intermediate layers to improve training stability and target localization.

(1): Binary Cross-Entropy (BCE) Loss

Binary Cross-Entropy is used to supervise pixel-level classification. For a predicted probability map

\hat{Y}

, and ground truth label map Y, the BCE loss is defined as:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [Y_{i} \cdot l o g (\hat{Y_{i}}) + (1 - Y_{i}) \cdot l o g (1 - \hat{Y_{i}})]

(12)

where N is the total number of pixels. This term ensures the pixel-wise correctness of the final binary prediction.

(2): Soft-IoU Loss

To address the foreground-background imbalance and better model region-level accuracy, we incorporate Soft-IoU loss, which directly optimizes the Intersection-over-Union between the prediction and ground truth:

L_{I o U} = 1 - \frac{\sum_{i} \hat{Y_{i}} \cdot Y_{i}}{\sum_{i} \hat{Y_{i}} + Y_{i} - \hat{Y_{i}} \cdot Y_{i} + ϵ}

(13)

where

ϵ

is a small constant to avoid division by zero. Soft-IoU encourages spatial alignment between predicted blobs and actual target regions, which is particularly critical for sparse and small objects.

(3): Multi-Scale Feature Supervision Loss

To guide the intermediate feature learning in FIFM and PAB, we introduce a deep supervision strategy. Let

{\hat{Y}}^{(l)}

be the auxiliary output of the intermediate feature map from layer l, and Y be the ground truth label map resized to the same resolution. The supervision loss is defined as:

L_{f e a t} = \sum_{l = 1}^{L} α_{l} \cdot L_{B C E} ({\hat{Y}}^{(l)}, Y^{(l)})

(14)

where

α_{l}

is the weighting factor for each layer, and L is the total number of supervised layers. This term enforces target-aware feature enhancement at different depths.

(4): Semantic Alignment Loss

The Asymmetric Contextual Fusion Module aligns low-level spatial features and high-level semantic information. To reduce semantic inconsistency, we introduce a semantic alignment constraint:

L_{a l i g n} = {‖μ (C_{1} (F_{l o w})) - μ (C_{2} (F_{h i g h}))‖}_{2}^{2}

(15)

where

F_{l o w}

and

F_{h i g h}

are the low-level and high-level features,

C_{1}

and

C_{2}

are projection layers (e.g., 1 × 1 convolutions), and

μ

(⋅) is a non-linear activation (e.g., ReLU). This loss enforces consistency in semantic representations across scales.

The total loss function used to train MFAFNet is a weighted combination of the above components:

L_{a l i g n} = λ_{1} L_{B C E} + λ_{2} L_{I o U} + λ_{3} L_{f e a t} + λ_{4} L_{a l i g n}

(16)

where

λ_{1}

,

λ_{2}, λ_{3}, λ_{4}

are hyperparameters that control the contribution of each loss term.

4. Experiments

In this section, we employ experimental techniques to assess the efficacy of MFAFNet. We commence by providing an overview of the implementation details. Subsequently, the experimental setup is presented, including the evaluation metrics, public datasets, and state-of-the-art methods used for comparison. In parallel, we conduct ablation studies for each module with the objective of evaluating the practicality of each module. Finally, a comparison is made between the proposed method and state-of-the-art methods in both the visual and numerical fields.

4.1. Implementation Details and Setting

The input images used in our method vary in size; therefore, a preprocessing module is employed to uniformly resize all three input types (raw infrared, intensity, and gradient images) to a resolution of 256 × 256. These preprocessed images are then stacked into batches and fed into the backbone network for feature extraction. The feature extraction backbone is based on ResNet-50, and three parallel extraction streams-corresponding to each image modality-are constructed using identical architectures. Each stream begins with a convolutional layer with a kernel size of 7, followed by three residual blocks. Notably, each residual layer is embedded with a Multi-Feature Interactive Fusion Module to enhance feature integration across modalities. The output feature map has a spatial resolution of 32 × 32. Within the Patch Attention Block, multi-scale patches of varying sizes (e.g., 1 × 1, 3 × 3, and 5 × 5) are utilized to model contextual dependencies at different receptive fields. Given the significant class imbalance between small targets and background regions, we adopt the Soft-IoU loss function to improve training stability and enhance sensitivity to small foreground objects. The entire network is trained for 100 epochs using the AdaGrad optimizer, with an initial learning rate of 0.001 and a batch size of 16, without any data augmentation strategies. All experiments are conducted on a workstation equipped with an NVIDIA RTX 4090 GPU.

In this study, we adopt commonly used pixel-level evaluation metrics, including Precision, Recall, Intersection over Union (IoU), F-measure, and Area Under the Curve (AUC), to comprehensively assess the performance of infrared small target detection methods on benchmark datasets.

The IoU is a pixel-level evaluation metric, which is calculated by the ratio of the intersection and the union areas between the predicted result and the label.

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(17)

P r e c i s i o n

and

R e c a l l

, respectively, represent the ratio of correctly classified pixels to labelled positive targets and predicted positive targets. The F-measure is employed to assess the relationship between

P r e c i s i o n

and

R e c a l l

.

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

R e c a l l = \frac{T P}{T P + F N}

(19)

F_{m e a s u r e} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(20)

Furthermore, the area under the curve (AUC) is introduced as a quantitative metric for receiver operating characteristic (ROC) to evaluate the comprehensive potential of a method. The true positive rate (TPR) and the false positive rate (FPR) are defined as follows:

F P R = \frac{F P}{F P + T N}

(21)

T P R = \frac{T P}{T O + F N}

(22)

Since there are few public infrared small target detection datasets, part of the ISTD dataset comes from artificial synthesis with target and background. Through the SIRST dataset [14], which is widely used, and a real IRST dataset with high-quality images, it has only 427 images in total, which leads to problems, such as unstable network training and easy over-fitting. Hence, we consider the SIRST-Aug dataset [16], which is based on SIRST and enhanced (clipping, inversion, displacement, etc.). SIRST-Aug includes 8525 images as a training set and 545 images as a test set with the fixed resolution 256 × 256, which is enough to train neural networks.

On the other hand, we also adopt the IRSTD-1k dataset [35], captured by an infrared camera in the real world. The IRSTD-1k comprises 1000 infrared images and labels at the pixel level, all with a resolution of 512 × 512. We divided the IRSTD-1k dataset into a training set, a validation set, and a test set, with a proportion of 50%, 30% and 20%, respectively. The validation set is used exclusively for hyperparameter tuning and model selection, while the test set is kept unseen during training to provide an unbiased performance evaluation. The dataset includes different kinds of targets, such as drones, vessels, and vehicles. Meanwhile, it covers lots of different real scenes and backgrounds, including the sea, river, sky, and city, with random clutter and noise.

Figure 10 shows the representation images of the SIRST-Aug and IRSTD-1k dataset. In addition, the detail of these two datasets is listed in Table 1. Compared in these two datasets, we can evaluate the effectiveness of the proposed method in the real dataset and the generated dataset, respectively.

4.2. Ablation Study

To analyze the rationality of each component of MFAFNet, we run our models with different settings.

Effectiveness of local feature. As shown in Table 2, we verify their effectiveness by introducing two different local features. In MIFM, the intensity feature and gradient feature, respectively, are removed. Meanwhile, we are keeping the same segmentation task, evaluation metrics, and the dataset.

From Table 2, we can see that when a gradient feature and an intensity feature are introduced into the same backbone, the performance of the model can be improved in the same dataset. This improvement in performance can be attributed to two factors. Firstly, the local feature can enhance the infrared small target in the original image, which has greater contrast with the background. Secondly, a single local feature may similarly enhance the noise and clutter, resulting in a slight performance degradation. To address this, we introduce a different local feature to constrain the enhanced clutter with an appropriate complexity.

Effectiveness of local patch size in PAB. Given that the feature map size is 32 × 32 with down-sampling occurring three times, it can be inferred that each pixel in the feature map represents an 8 × 8 region in the original image. To investigate the effect of the patch size, multiple sets of comparison experiments were set up. It can be posited that the appropriate patch size should be capable of covering the entire target area and a certain background. To test this hypothesis, an experiment was designed with different patch sizes. As shown in Table 3, it can be observed that the combination of patch size and PAB exhibits the most optimal performance. This indicates that the utilization of different patch sizes can facilitate the more comprehensive coverage of diverse targets within the dataset.

Effectiveness of ACFM. To explicitly investigate the contribution of the ACFM, which is designed to bridge shallow and deep features via pixel-wise and channel-wise attention, we conduct a dedicated ablation study. While the previous section validates the effectiveness of the FIFM and the PAB, the standalone impact of ACFM was previously embedded within the complete pipeline, making it difficult to discern its individual role. To isolate the effect of the Asymmetric Contextual Fusion Module (ACFM) and examine its interaction with other components, we design four variants of MFAFNet: (1) w/o ACFM, which replaces the module with a direct summation of low-level and deep-level features without attention; (2) w/o ACFM + FIFM, which removes both ACFM and FIFM to assess the synergy between semantic alignment and multi-modal fusion; (3) w/o ACFM + PAB, which eliminates ACFM and PAB to evaluate the role of ACFM in supporting global-local integration; and (4) w/o ACFM + FIFM + PAB, which retains only the backbone and segmentation head, removing all three modules to serve as a minimal reference model. We compare these variants against the full MFAFNet using the SIRST-Aug dataset. The results are shown in Table 4.

From Table 4, removing only the ACFM module results in a performance drop of 4.1% in F-measure and 4.3% in mIoU, demonstrating its significant standalone impact on detection quality. The observed reduction confirms that the attention-based asymmetric fusion of shallow and deep features improves both target localization and boundary delineation, especially in infrared imagery where semantic granularity is unevenly distributed. Furthermore, when ACFM is removed in conjunction with FIFM or PAB, the performance deteriorates even further. This indicates that ACFM is not only individually effective but also works synergistically with other modules. For example, without ACFM and FIFM, the mIoU drops to 0.6802, suggesting that FIFM benefits from the refined semantic alignment provided by ACFM. Likewise, removing ACFM and PAB causes a greater performance decline than removing either one alone, highlighting that ACFM enhances the contextual modeling effect of PAB by providing more semantically aligned feature inputs.

Ablation on Hybrid Loss Weights. To investigate the effect of the weighting factors (λ₁–λ₄) in Equation (16), we conduct an ablation study on the SIRST-Aug dataset. Specifically, we vary the weights assigned to each component loss while keeping the others fixed. The results are summarized in Table 5.

From Table 5, it is clear that the balanced setting (all weights = 1) achieves the best tradeoff across metrics. Increasing the contribution of a single term often biases the optimization towards pixel accuracy or region alignment but degrades overall generalization. This confirms that the hybrid loss benefits from a synergistic balance, and further hyperparameter tuning yields only marginal improvements.

4.3. Comparison to Excellent Methods

To comprehensively evaluate the effectiveness of the proposed MFAFNet, we compare it with a series of representative infrared small target detection methods specifically designed for remote sensing scenarios. The baseline methods include both model-driven approaches, such as LEF [22], WSLCM [20], PSTNN [12], and MSLSTIPT [36], which rely on filtering, local contrast, or low-rank priors; and data-driven deep learning methods, including AGPCNet [16], DNANet [28], ACM [14], MSHNet [37], RISTDnet [24], Trans-IRSD [27], IRSAM [38], STPSA-Net [39], and MiM-ISTD [40]. All comparative models are implemented using the parameter settings provided in their respective original papers.

Among the classical approaches, LEF introduces a local energy factor to jointly evaluate local dissimilarity and brightness contrast, while WSLCM incorporates a strengthened local contrast measure and a region-aware weighting function to suppress noise in complex backgrounds. RIPT models target-background separation as a robust low-rank tensor recovery problem using patch-based tensor representation, and PSTNN improves upon RIPT by introducing a non-convex low-rank regularizer and a background-aware prior map, enabling a better balance between detection accuracy and computational efficiency. MSLSTIPT further explores spatial-temporal structure by leveraging multisubspace learning in tensor form, effectively handling heterogeneous scenes.

In the realm of learning-based methods, ACM proposes an asymmetric contextual modulation structure that fuses global feedback and channel attention to highlight small targets. AGPCNet constructs an attention-guided pyramid framework, combining patch-level association and global context modeling through a multi-scale structure. DNANet introduces a dense nested attention design to maintain target features across layers, while MSHNet focuses on optimizing the loss function via a scale- and location-sensitive formulation (SLS) to improve detection robustness. RISTDnet integrates handcrafted features with deep CNNs to build a likelihood estimation framework suitable for low-SNR conditions. Trans-IRSD employs a transformer backbone with feature enhancement modules to extract global dependencies and mitigate missed detections in complex infrared backgrounds. IRSAM adapts the Segment Anything Model (SAM) to thermal imaging, combining Perona-Malik diffusion and a granularity-aware decoder to bridge the domain gap. STPSA-Net encodes infrared images into semantic tokens and applies patch-wise spatial attention to enhance representation and localization.

Furthermore, MiM-ISTD adopts a state-space model inspired by Mamba to overcome the quadratic complexity of traditional transformers. It introduces a nested Mamba-in-Mamba structure, where Outer Mamba captures global dependencies across visual “sentences” (patches), and Inner Mamba captures fine-grained local structures within sub-patches. This design ensures efficient inference on high-resolution infrared images, achieving substantial reductions in computational cost and memory usage while maintaining state-of-the-art accuracy.

(1): Quantitative evaluation: The results of different methods on two public datasets, SIRST-Aug and IRSTD-1k, are presented in Table 5. The maximum value of each column is indicated in bold black. For the model-driven method, the obtained predicts indicate the probability of detecting targets. Hence, we set an adaptive threshold $T$ to remove low-response areas:

T = M a x [M a x (G) \times 0.7, 0.5 \times σ (G) + a v g (G)]

(23)

where

M a x (G)

represents the large value of the output image, and

σ (G)

and

a v g (G)

mean the standard deviation and average value. For data-driven methods, we adopt the fixed value, which is the same as in their original papers. Following common evaluation practice in infrared small target detection, we adopt adaptive thresholding (Equation (23)) for model-driven methods and fixed thresholds from original papers for deep learning-based methods. This choice reflects the methodological nature of each category: model-driven methods typically produce continuous response maps requiring post-hoc threshold tuning, while learning-based methods are trained end-to-end with fixed output heads and thresholds jointly optimized during training. Using fixed thresholds for deep models ensures consistency with their reported performance and published baselines. We acknowledge that the adaptive thresholding might not fully compensate for the sensitivity of handcrafted methods, potentially leading to lower recall. However, our goal here is not to penalize model-driven methods, but to provide a realistic and reproducible comparative benchmark under commonly accepted evaluation conditions.

From the data presented in Table 6, it can be observed that our MFAFNet methods demonstrate a clear superiority over model-driven methods. This is because model-driven methods are typically designed for specific scenarios, and the performance of these methods is highly dependent on the manually selected hyperparameters (e.g., patch size, threshold value), which limit their generalizability. Moreover, we can observe that most model-driven methods exhibit high precision and low recall, indicating that these methods are able to correctly detect a limited number of pixels within the target region, rather than the entire target. As shown in Figure 11, Figure 12, Figure 13 and Figure 14, model-driven methods obtain the correct target region with a few pixels, which leads to lower mIoU than data-driven methods. The reason for this phenomenon is that model-driven methods introduce various constraints to suppress the background while suppressing the target.

As shown in Table 6, the improvement achieved by MFAFNet over other data-driven methods is obvious. On the SIRST-Aug dataset, MFAFNet achieves a precision of 0.8376, a recall of 0.8652, an mIoU of 0.7465, and an F-measure of 0.8512, representing the highest performance among all competing methods. Compared with MiM-ISTD, which achieves an F-measure of 0.8431 and an mIoU of 0.7288, MFAFNet demonstrates improved detection stability and boundary segmentation accuracy. While IRSAM shows strong recall at 0.8493 and an F-measure of 0.8256, its mIoU of 0.7025 is still noticeably lower than that of MFAFNet, indicating inferior localization capability for small targets. STPSA-Net, as a transformer-based method emphasizing semantic token refinement, achieves competitive results, with an F-measure of 0.7954 and an mIoU of 0.6603, yet still falls short of MFAFNet in all metrics.

Classical model-driven methods such as LEF, WSLCM, RIPT, PSTNN, and MSLSTIPT exhibit relatively high precision but suffer from extremely low recall and poor mIoU values. For instance, RIPT reports a precision of 0.9484 but only achieves a recall of 0.0800 and an mIoU of 0.0797, resulting in a low F-measure of 0.1476. Similar trends are observed for PSTNN and WSLCM, where conservative detection strategies result in severe under-detection of small targets. MSLSTIPT, though designed for heterogeneous backgrounds, yields the lowest recall of 0.0400 and the lowest F-measure of 0.0766, indicating poor adaptation to the single-frame detection setting.

Among other deep learning methods, DNANet, AGPC, and ACM present severe imbalances between precision and recall. DNANet reaches a relatively high precision of 0.8420 but achieves a recall of only 0.0304 and an mIoU of 0.0302. AGPC and ACM similarly exhibit limited recall, which compromises their overall detection performance. MSHNet, which focuses on loss function design, shows improved balance between precision and recall compared to other learning-based baselines, but its F-measure of 0.2328 remains significantly lower than that of MFAFNet.

On the IRSTD-1k dataset, MFAFNet continues to demonstrate superior generalization performance. It achieves a precision of 0.7907, a recall of 0.8146, an mIoU of 0.6701, and an F-measure of 0.8025, outperforming all compared methods in terms of both localization accuracy and overall segmentation quality. MiM-ISTD ranks second with an F-measure of 0.7879 and an mIoU of 0.6500, confirming its robustness but still slightly inferior to MFAFNet. IRSAM achieves an F-measure of 0.7526 and an mIoU of 0.6034, showing strong consistency but relatively weaker region precision. STPSA-Net obtains an F-measure of 0.7417 and an mIoU of 0.5895, indicating that its semantic refinement strategy is effective but less capable in handling cluttered backgrounds compared to MFAFNet.

Model-driven methods on IRSTD-1k continue to exhibit significant performance gaps. LEF, PSTNN, and RIPT report mIoU values below 0.09, with F-measure scores ranging from 0.1073 to 0.1495. Although these methods achieve moderate precision, their recall remains too low to support practical applications in real-world remote sensing. RISTDnet performs relatively better among model-driven methods, reaching an F-measure of 0.3024 and an mIoU of 0.1782, but still lags far behind the learning-based models.

Overall, MFAFNet demonstrates state-of-the-art performance on both benchmark datasets. Its consistently high scores across all evaluation metrics validate the effectiveness of its multi-feature fusion, patch-level attention, and semantic contextual integration strategies. The results confirm that MFAFNet provides a robust and accurate solution for infrared small target detection under complex remote sensing conditions.

(2): Qualitative evaluation: To further illustrate the effectiveness of our method in complex infrared imaging scenarios, qualitative detection results are presented on two benchmark datasets, SIRST-Aug and IRSTD-1k, as shown in Figure 11, Figure 12, Figure 13 and Figure 14. Four representative scenes are selected from each dataset, covering typical challenges such as low contrast, cluttered backgrounds, and small-scale targets. The output results of ten representative methods, including both model-driven and learning-based approaches, are displayed in a unified 3D visualization format for intuitive comparison. Each method’s output is marked in the upper-left corner of the corresponding subfigure to ensure clarity.

The visual results indicate that traditional model-driven methods such as LEF, RIPT, and PSTNN tend to produce overly smooth responses or fail to highlight small targets, especially under low signal-to-clutter conditions. These methods often suppress background noise effectively but sacrifice target recall, resulting in incomplete or missing detections. For example, in cluttered background scenarios, PSTNN exhibits weak responses and underestimates target size due to its strong regularization priors.

Deep learning-based methods such as DNANet, AGPC, and ACM provide more distinct responses in the target regions but often introduce significant false alarms in background areas. This phenomenon is particularly evident in densely textured scenes, where the lack of fine-grained semantic supervision leads to over-activation around noise edges and structural clutter.

In contrast, recent Transformer-based or attention-enhanced methods like MiM-ISTD, IRSAM, and STPSA-Net show noticeable improvements in target saliency and background suppression. MiM-ISTD demonstrates strong target localization with relatively clean backgrounds but occasionally produces blurred target edges. IRSAM and STPSA-Net offer sharper segmentation boundaries; however, in low-contrast regions, they may still fail to differentiate faint targets from thermal clutter.

Our proposed MFAFNet achieves the most accurate and visually consistent results across all scenes. It effectively highlights true target regions while suppressing irrelevant background responses, even in scenes with multiple interfering artifacts or weak target contrast. The predicted masks align closely with the ground truth in both shape and position, demonstrating high confidence and completeness. The integration of multi-feature fusion and hierarchical attention mechanisms enables MFAFNet to maintain semantic consistency while capturing local structural cues critical for small target delineation.

Overall, the qualitative results corroborate the findings from quantitative evaluations. MFAFNet offers superior target saliency, precise localization, and robust background discrimination, validating its practical applicability for infrared small target detection in diverse remote sensing environments.

4.4. Efficiency and Deployment Evaluation

To further verify the practical applicability of MFAFNet under real-world constraints, especially in computationally limited remote sensing platforms, we conducted an in-depth evaluation of its inference efficiency, model complexity, and feasibility of lightweight deployment. This section complements the performance assessment by adding resource-oriented metrics, including FLOPs, model parameters, inference time, and memory footprint. Additionally, we explore the potential of replacing the ResNet-50 backbone with a lightweight alternative (MobileNetV2) to validate real-time suitability. As shown in Table 7, MFAFNet achieves a favorable balance between accuracy and efficiency, significantly reducing computational cost compared with transformer-based methods while maintaining superior detection performance.

Compared to transformer-based MiM-ISTD, MFAFNet reduces both FLOPs and inference time by ~43%, while still achieving higher mIoU and F-measure. The original MFAFNet takes only 24.7 ms per image, corresponding to ~40 FPS, which is acceptable for most real-time monitoring scenarios.

When integrated with MobileNetV2, MFAFNet-Lite achieves a significant reduction in model size and latency with only a slight performance drop (e.g., F-measure decreases by 1.8% on SIRST-Aug). This result suggests that the architecture is amenable to efficient deployment on edge devices such as onboard UAV systems or embedded GPU platforms.

5. Conclusions

In this paper, we propose MFAFNet, a novel infrared small target detection framework designed to enhance detection accuracy in complex remote sensing scenarios. The network introduces a Multi-Feature Interactive Fusion Module, which is embedded beneath the convolutional layers to effectively extract and integrate local saliency features inspired by the human visual system. This fusion strategy enriches the semantic representation of small targets while suppressing background interference. To further improve contextual understanding, we design a Patch Attention Block that captures long-range dependencies through multi-scale patch-wise attention. A Cascaded Channel and Spatial Attention Module is incorporated to adaptively refine the aggregated features across scales. Additionally, an Asymmetric Contextual Fusion Module is employed to integrate low-level spatial details with high-level semantic cues, enabling more precise and robust target localization. Ablation studies confirm the effectiveness of each individual component within the network. Comprehensive experiments on both synthetic and real-world infrared datasets demonstrate that MFAFNet achieves superior performance compared to state-of-the-art methods, with higher detection accuracy and fewer false alarms under challenging conditions.

Author Contributions

Conceptualization, W.C.; Data curation, H.W.; Formal analysis, S.D.; Funding acquisition, H.W.; Investigation, Y.C. and H.W.; Methodology, Z.Z., S.D. and Y.C.; Project administration, W.C.; Resources, Z.Z.; Software, Z.Z. and W.C.; Supervision, Y.C.; Writing—original draft, Z.Z.; Writing—review and editing, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the West Light Foundation of the Chinese Academy of Sciences under the Grant XAB2022YN06.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ma, J.; Pan, N.; Yin, D.; Wang, D.; Zhou, J. MEFA-Net: Multilevel feature extraction and fusion attention network for infrared small-target detection. Remote Sens. 2025, 17, 2502. [Google Scholar] [CrossRef]
Li, S.; Huang, J.; Duan, Q.; Li, Z. WT-HMFF: Wavelet transform convolution and hierarchical multi-scale feature fusion network for detecting infrared small targets. Remote Sens. 2025, 17, 2268. [Google Scholar] [CrossRef]
Han, Y.; Zhang, P.; Fei, C.; Wang, X. Infrared small target detection based on spatio-temporal saliency in video sequence. In Proceedings of the IEEE International Computer Conference on Wavelet Active Media Technology and Information Processing, Chengdu, China, 18–20 December 2015; pp. 279–282. [Google Scholar]
Sun, X.; Guo, L.; Zhang, W.; Wang, Z.; Yu, Q. Small aerial target detection for airborne infrared detection systems using LightGBM and trajectory constraints. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 9959–9973. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Signal and Data Processing of Small Targets 1993, Proceedings of the Optical Engineering and Photonics in Aerospace Sensing, Orlando, FL, USA, 20–22 July 1999; SPIE: Bellingham, WA, USA, 1993; pp. 74–83. [Google Scholar]
Wang, X.; Lv, G.; Xu, L. Infrared dim target detection based on visual attention. Infrared Phys. Technol. 2012, 55, 513–521. [Google Scholar] [CrossRef]
Chen, C.; Li, H.; Wei, Y.; Xia, T.; Tang, Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, L.; Yuan, D.; Chen, H. Infrared small target detection based on local intensity and gradient properties. Infrared Phys. Technol. 2018, 89, 88–96. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared small target detection via non-convex rank approximation minimization joint/_2,1 norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Qin, Y.; Bruzzone, L.; Gao, C.; Li, B. Infrared small target detection based on facet kernel and random walker. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7104–7118. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Guan, T.; Chang, S.; Deng, Y.; Xue, F.; Wang, C.; Jia, X. Oriented SAR ship detection based on edge deformable convolution and point set representation. Remote Sens. 2025, 17, 1612. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 950–959. [Google Scholar]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3163410. [Google Scholar] [CrossRef]
Zhang, T.; Cao, S.; Pu, T.; Peng, Z. AGPCNet: Attention-guided pyramid context networks for infrared small target detection. arXiv 2021, arXiv:2111.03580. [Google Scholar] [CrossRef]
Deng, H.; Sun, X.; Liu, M.; Ye, C.; Zhou, X. Small infrared target detection based on weighted local difference measure. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4204–4214. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Xia, C.; Li, X.; Zhao, L.; Shu, R. Infrared small target detection based on multiscale local contrast measure using local energy factor. IEEE Geosci. Remote Sens. Lett. 2019, 17, 157–161. [Google Scholar] [CrossRef]
Liu, M.; Du, H.; Zhao, Y.; Dong, L.; Hui, M.; Wang, S.X. Image small target detection based on deep learning with SNR controlled sample generation. Curr. Trends Comput. Sci. Mech. Autom. 2017, 1, 211–220. [Google Scholar]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust infrared small target detection network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7000805. [Google Scholar] [CrossRef]
Yang, Z.; Ma, T.; Ku, Y.; Ma, Q.; Fu, J. DFFIR-net: Infrared dim small object detection network constrained by gray-level distribution model. IEEE Trans. Instrum. Meas. 2022, 71, 5026215. [Google Scholar] [CrossRef]
Chen, F.; Gao, C.; Liu, F.; Zhao, Y.; Zhou, Y.; Meng, D.; Zuo, W. Local patch network with global attention for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3979–3991. [Google Scholar] [CrossRef]
Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared small and dim target detection with transformer under complex backgrounds. IEEE Trans. Image Process. 2023, 32, 5921–5932. [Google Scholar] [CrossRef] [PubMed]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Aidan, N.; Kziser, L. Attention is all you need. Adv. Neural Inf. Process Syst. 2017, 30, 1–11. [Google Scholar]
Alexey, D.; Lucas, B.; Alexander, K.; Dirk, W.; Zhai, X.; Thomas, U.; Mostafa, D.; Matthias, M.; Georg, H.; Sylvain, G.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 1601–1610. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–23 June 2022; pp. 877–886. [Google Scholar]
Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]
Liu, Q.; Liu, R.; Zheng, B. Infrared small target detection with scale and location sensitivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17490–17499. [Google Scholar]
Zhang, M.; Wang, Y.; Guo, J. IRSAM: Advancing segment anything model for infrared small target detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 233–249. [Google Scholar]
Liu, S.; Qiao, B.J.; Li, S. Patch spatial attention networks for semantic token transformer in infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5003014. [Google Scholar] [CrossRef]
Chen, T.; Ye, Z.; Tan, Z.; Gong, T.; Wu, Y.; Chu, C.; Liu, B.; Yu, N.; Ye, J. MiM-ISTD: Mamba-in-mamba for efficient infrared small-target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5007613. [Google Scholar] [CrossRef]

Figure 1. The results of HVS methods include MPCM, LIG, and RLCM.

Figure 2. Overall architecture of proposed MFAFNet. The network is composed of three functional modules: FIFM for feature-level enhancement, PAB for contextual modeling, and ACFM for final semantic fusion and upsampling.

Figure 3. Original images, gradient feature, intensity feature, and LIG images.

Figure 4. An illustration of the FIFM among each layer. (a–c) have the output of FIFM and the input of the next layer with the same size.

Figure 5. Illustration of patch attention block.

Figure 6. Illustration of Non-Local network.

Figure 7. Illustration of cascaded channel and spatial attention module.

Figure 8. Illustration of asymmetric contextual fusion module.

Figure 9. Heatmap comparison of feature activations before and after ACFM. ACFM reduces the semantic gap by aligning low-level and high-level representations, resulting in sharper target boundaries and better clutter suppression.

Figure 10. Representative sample of datasets.

Figure 11. Detection result of each comparison method in SIRST-Aug (Scene 1) and relative 3-D display.

Figure 12. Detection result of each comparison method in SIRST-Aug (Scene 2) and relative 3-D display.

Figure 13. Detection result of each comparison method in IRSTD-1k (Scene 1) and relative 3-D display.

Figure 14. Detection result of each comparison method in IRSTD-1k (Scene 2) and relative 3-D display.

Table 1. Details of Datasets.

Datasets	Training Images	Testing Images	Image Size	Smallest Target Size
SIRST-Aug	8525	545	256 × 256	2 × 2
IRSTD-1k	800	201	512 × 512	3 × 3

Table 2. Ablation Study on the MIFM.

MIFM	SIRST-Aug		IRSTD-1k
MIFM	IoU	Fmeasure	IoU	Fmeasure
Without I and G	0.6861	0.8138	0.6217	0.7713
With I	0.7252	0.8440	0.6514	0.7885
With G	0.7101	0.8337	0.6413	0.7803
With I and G	0.7465	0.8512	0.6701	0.8025

Table 3. Ablation Study on the patch size.

Patch Size	SIRST-Aug		IRSTD-1k
Patch Size	IoU	Fmeasure	IoU	Fmeasure
[3]	0.6986	0.8289	0.6252	0.7754
[3, 5]	0.7154	0.8365	0.6326	0.7791
[3, 5, 6]	0.7266	0.8402	0.6537	0.7965
[3, 5, 6, 10]	0.7465	0.8512	0.6701	0.8025

Table 4. Isolated and Joint Impact of ACFM on Detection Performance (SIRST-Aug).

Method	Precision	Recall	mIoU	F-Measure
w/o ACFM	0.8057	0.8150	0.7036	0.8102
w/o ACFM + FIFM	0.7821	0.8038	0.6802	0.7928
w/o ACFM + PAB	0.7946	0.7991	0.6855	0.7968
w/o ACFM + FIFM + PAB	0.7812	0.7925	0.6824	0.7868
Full MFAFNet	0.8376	0.8652	0.7465	0.8512

Table 5. Ablation on Loss Weights (SIRST-Aug).

Configuration	λ₁ (BCE)	λ₂ (IoU)	λ₃ (Feature)	λ₄ (Semantic)	F-Measure	mIoU
Equal Weights (default)	1	1	1	1	0.8512	0.7465
Emphasis on BCE	2	1	1	1	0.8365	0.7281
Emphasis on IoU	1	2	1	1	0.8427	0.7346
Emphasis on Feature Supervision	1	1	2	1	0.8442	0.7391
Emphasis on Semantic Alignment	1	1	1	2	0.8475	0.7410

Table 6. Comparison with State-of-the-art Methods on SIRST-Aug and IRSTD-1k.

Methods	SIRST-Aug				IRSTD-1k
Methods	Precision	Recall	mIoU	Fmeasure	Precision	Recall	mIoU	Fmeasure
LEF	0.6837	0.3929	0.3325	0.4990	0.0832	0.3248	0.0709	0.1324
WSLCM	0.9334	0.1038	0.1030	0.1868	0.5443	0.0952	0.0882	0.1621
RIPT	0.9484	0.0800	0.0797	0.1476	0.5765	0.0591	0.0567	0.1073
PSTNN	0.9421	0.1043	0.1036	0.1878	0.4838	0.0884	0.0808	0.1495
MSLSTIPT	0.8810	0.0400	0.0398	0.0766	0.6602	0.0470	0.0458	0.0877
ACM	0.6145	0.1061	0.0994	0.1809	0.8117	0.1169	0.1138	0.2044
DNANet	0.8420	0.0304	0.0302	0.0587	0.8324	0.0540	0.0535	0.1015
AGPC	0.9531	0.0605	0.0603	0.1138	0.6208	0.0465	0.0453	0.0866
MSHNet	0.9575	0.1325	0.1317	0.2328	0.2594	0.2072	0.1302	0.2303
Trans-IRSD	0.9670	0.0740	0.0738	0.1374	0.6020	0.1066	0.0996	0.1812
RISTDnet	0.6595	0.1229	0.1155	0.2072	0.4236	0.2352	0.1782	0.3024
IRSAM	0.8031	0.8493	0.7025	0.8256	0.7107	0.7998	0.6034	0.7526
STPSA-Net	0.8384	0.7566	0.6603	0.7954	0.7653	0.7195	0.5895	0.7417
MiM-ISTD	0.8323	0.8542	0.7288	0.8431	0.7850	0.7908	0.6500	0.7879
MFAFNet	0.8376	0.8652	0.7465	0.8512	0.7907	0.8146	0.6701	0.8025

Table 7. Summarizes the computational and memory consumption metrics of MFAFNet and other representative methods on the SIRST-Aug dataset, measured on an RTX 4090 GPU.

Method	Params (M)	FLOPs (G)	Inference Time (ms/img)	GPU Memory (MB)
DNANet	9.2	21.8	13.6	1790
Trans-IRSD	38.7	102.3	48.2	3110
MiM-ISTD	27.3	61.5	37.4	2685
MFAFNet	16.8	34.9	24.7	2163
MFAFNet-Lite (MobileNetV2)	6.1	13.7	11.2	1420

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.; Chen, W.; Dong, S.; Chen, Y.; Wang, H. MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection. Remote Sens. 2025, 17, 3070. https://doi.org/10.3390/rs17173070

AMA Style

Zhao Z, Chen W, Dong S, Chen Y, Wang H. MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection. Remote Sensing. 2025; 17(17):3070. https://doi.org/10.3390/rs17173070

Chicago/Turabian Style

Zhao, Zehao, Weining Chen, Seng Dong, Yaohong Chen, and Hao Wang. 2025. "MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection" Remote Sensing 17, no. 17: 3070. https://doi.org/10.3390/rs17173070

APA Style

Zhao, Z., Chen, W., Dong, S., Chen, Y., & Wang, H. (2025). MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection. Remote Sensing, 17(17), 3070. https://doi.org/10.3390/rs17173070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MFAFNet: A Multi-Feature Attention Fusion Network for Infrared Small Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Traditional Model-Driven Approaches

2.2. Data-Driven Deep Learning Approaches

2.3. Transformer-Based Methods in Vision

3. Methods

3.1. Overall Architecture

3.2. Multi-Feature Interactive Fusion Module

3.3. Patch Attention Block

3.4. Asymmetric Contextual Fusion Module

3.5. Loss Function

4. Experiments

4.1. Implementation Details and Setting

4.2. Ablation Study

4.3. Comparison to Excellent Methods

4.4. Efficiency and Deployment Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI