A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection

Wang, Yanwei; Zhang, Haitao; Zhang, Xiangyue; Zheng, Xinhao

doi:10.3390/electronics15040868

Open AccessArticle

A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection

¹

State Key Laboratory of Precision Space-Time Information Sensing Technology, Department of Precision Instrument, Tsinghua University, Beijing 100084, China

²

Faculty of Robot Science and Engineering, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(4), 868; https://doi.org/10.3390/electronics15040868

Submission received: 26 January 2026 / Revised: 10 February 2026 / Accepted: 18 February 2026 / Published: 19 February 2026

Download

Browse Figures

Versions Notes

Abstract

Infrared small target detection under complex backgrounds remains challenging due to the extremely small target size and low contrast with the surrounding background. These factors make contour information difficult to extract and often cause target features to attenuate or disappear during deep feature learning. To address these issues, this paper proposes a Gradient-Compensation-based Feature Learning Network (GCFLNet). GCFLNet adopts a multi-module collaborative design to enhance feature representation and fusion. First, an Edge Enhancement Module (EEM) is introduced to accurately capture fine-grained edge information of infrared small targets while suppressing background noise through smoothing operations. This provides reliable structural cues for subsequent feature extraction. Second, the extracted edge features are embedded into a Global–Local Feature Interaction (GLFI) module, which is inspired by self-attention mechanisms with dilated convolutions to strengthen global semantic dependencies and local detail representation, enabling effective enhancement of target features. In addition, a Multi-Scale Information Compensation (MSIC) module is designed to exploit the complementary characteristics of multi-scale features across spatial and channel dimensions, guiding efficient fusion of high-level and low-level information. Experimental results on the NUDT and IRSTD-1K datasets demonstrate that GCFLNet outperforms existing state-of-the-art methods, achieving higher detection accuracy and robustness for infrared small targets in complex backgrounds.

Keywords:

gradient compensation; feature interaction; infrared small target detection

1. Introduction

Infrared small target detection has attracted increasing attention due to its inherent robustness to illumination and weather conditions, making it highly valuable for a wide range of critical applications, such as remote sensing surveillance, fire early warning, maritime security, and precision guidance. As a core component of infrared target detection and tracking systems, it aims to capture and identify small targets in diverse scenarios using infrared imaging, thereby providing reliable information for subsequent decision-making and execution. However, compared with visible-light images, infrared small target detection faces substantial challenges arising from complex imaging environments, unique imaging mechanisms, and demanding application requirements. First, in long-range imaging scenarios, infrared small targets usually occupy only a few pixels and lack clear shape, fixed scale, and rich texture information, resulting in extremely limited discriminative features. Second, complex background interference caused by clouds, birds, clutter, and other distractors often overwhelms weak targets, leading to high false alarm rates. Third, real-world applications such as precision guidance and real-time surveillance impose strict requirements on computational efficiency, while improving detection reliability typically increases model complexity, further exacerbating the difficulty of achieving real-time performance. Therefore, designing infrared small target detection algorithms that achieve high detection accuracy, low false alarm rates, and high efficiency remains a fundamental and challenging research problem.

In the early stages of infrared small target detection, numerous traditional methods were proposed, mainly including sparse matrix-based methods [1,2,3], filter-based methods [4,5,6], and local contrast-based methods [7,8,9]. Despite their effectiveness in specific scenarios, these methods suffer from inherent limitations. On the one hand, their performance heavily relies on handcrafted priors and manually tuned parameters, making them difficult to generalize across diverse and complex environments. On the other hand, their robustness to noise and clutter is limited, which restricts their ability to suppress background interference and accurately extract weak target features, leading to degraded detection performance in practical applications.

In recent years, deep learning has achieved remarkable success in computer vision and related fields, providing powerful end-to-end feature learning capabilities. Compared with traditional approaches, deep learning-based infrared small target detection methods can automatically learn discriminative representations without handcrafted features, offering superior robustness and generalization. However, due to the extremely small target size and complex backgrounds, target features are still prone to attenuation and loss during deep network propagation, which limits further performance improvements. Existing methods have attempted to alleviate these issues but remain constrained by notable drawbacks. For example, Li et al. [10] proposed a Densely Nested Attention Network (DNANet), which employs densely nested U-shaped architectures and channel-spatial attention to enhance feature interaction, but suffers from slow inference speed and limited edge localization accuracy. Li et al. [11] introduced a Multi-Directional Learnable Edge Information-Assisted Dense Nested Network, where edge information of small targets is explicitly extracted and fused with hierarchical features via dense connections, thereby strengthening the network’s perception of target boundary structures. Sun et al. [12] introduced a Receptive Field and Direction Induced Attention Network (RDIAN), which leverages receptive field and directional attention to mitigate target-background imbalance, achieving high efficiency at the cost of reduced feature representation capability.

To overcome these limitations, this paper proposes a Gradient Compensation-Based Feature Learning Network (GCFLNet) for infrared small target detection. The term “gradient-compensated” emphasizes the design philosophy of continuously reinforcing edge and structural cues during multi-stage feature learning, rather than introducing a standalone gradient operator. GCFLNet integrates the local modeling capability of convolutional neural networks with Transformer-inspired global context modeling, enabling effective representation of infrared small targets under complex backgrounds. By exploiting infrared gradient vector fields, the proposed network enhances target feature representation, enabling cleaner and more detailed feature extraction. In the encoder stage, an Edge Enhancement Module (EEM) is designed to suppress background noise while accurately capturing fine-grained target edge information. Subsequently, a Global–Local Feature Interaction (GLFI) module is introduced to cooperate with EEM, further reducing noise interference and achieving effective fusion of local details and global semantic information. In the decoder stage, a Multi-Scale Information Compensation (MSIC) module is proposed to promote deep interaction and fusion of multi-scale features by optimizing skip connections and up-sampling operations, thereby jointly improving localization accuracy and segmentation quality. The main contributions of this work are summarized as follows:

To address the issues of blurred edges and background noise interference in infrared small targets, an edge enhancement module (EEM) is proposed. By leveraging infrared gradient vector field characteristics, EEM accurately captures fine target contours while adaptively suppressing background clutter and noise, effectively enhancing target–background discrimination.
To jointly model local details and global semantic dependencies, a novel global–local feature interaction module (GLFI) is designed. By combining the local correlation modeling of CNNs with an attention-inspired global context modeling mechanism, and using edge-enhanced features as guidance, GLFI enables effective fusion of local and global features while further suppressing residual noise.
To fully exploit the complementary characteristics of multi-scale features, a multi-scale information compensation module (MSIC) is introduced. This module explores spatial and channel-wise differences across feature scales and facilitates adaptive interaction between high-level and low-level features, generating more discriminative representations for infrared small target detection.

2. Related Work

2.1. Model-Driven Detection Method

To meet the core requirement of background suppression and target enhancement in infrared small target detection, extensive studies have been conducted following a model-driven paradigm. These methods can generally be categorized into three groups: sparse-matrix-based methods, filter-based methods, and local-contrast-based methods.

Sparse-matrix-based methods exploit the inherent sparsity of infrared small targets by modeling infrared images as a combination of sparse targets and dense backgrounds. Target detection is achieved by solving matrix or tensor decomposition problems. Representative approaches include the Infrared Patch-Image (IPI) model [1], the Reweighted Infrared Patch Tensor (RIPT) method [2], the Partial Sum of Tensor Nuclear Norm (PSTNN) model [3], and non-convex tensor Tucker decomposition methods. Although these approaches can separate targets from backgrounds by leveraging sparsity priors, they generally exhibit limited robustness to background clutter and poor generalization capability, making them difficult to adapt to complex and dynamic scenes.

Filter-based methods aim to enhance target signals and suppress background noise through the design of specific filters, such as Top-Hat filtering [4], Max–Median filtering [5], and Principal Component Pursuit (PCP) [6]. These methods are computationally efficient and perform well in simple and smooth background scenarios. However, in complex environments such as heavy cloud cover or sea-sky backgrounds, fixed filter designs struggle to accommodate diverse interference patterns, often resulting in excessive false alarms and limited robustness.

Local-contrast-based methods construct contrast measures based on local differences between targets and backgrounds in terms of intensity or texture. Typical methods include the Local Contrast Measure (LCM) [7], Weighted Strengthened Local Contrast Measure (WSLCM) [8], and Three-Layer Local Contrast Measure (TLLCM) [9]. Most of these methods rely on the assumption that targets are brighter than their surrounding regions. Consequently, dim small targets embedded in complex cloud layers or high-intensity noise may be missed due to contrast degradation, limiting their scene adaptability.

In summary, traditional model-driven methods are constrained by fixed prior assumptions and hand-crafted designs, preventing them from adaptively learning deep target features. Their limited robustness to clutter and poor adaptability to complex backgrounds significantly restrict their detection performance in real-world scenarios.

2.2. Data-Driven Detection Method

With the rapid development of deep learning, data-driven methods have been widely adopted in infrared small target detection due to their strong capability for adaptive feature learning and data modeling. Compared with traditional model-driven approaches, deep learning methods generally demonstrate superior robustness and detection performance in complex backgrounds.

One major research direction focuses on balancing detection accuracy and inference speed through lightweight network design to meet real-time application requirements. Dai et al. [13] proposed an Asymmetric Context Modulation (ACM) module that aggregates high-level semantic information and low-level detailed features via bidirectional pathways. Hou et al. [14] developed a U-shaped network that transforms single-frame infrared images into small target likelihood maps and enhances feature representation through feature grouping during down-sampling. Ma et al. [15] introduced MiniIRNet, a lightweight model that enriches target representations using multi-scale contextual extraction and shallow–deep feature fusion. Hou et al. [16] proposed RISTDNet, which integrates handcrafted features with convolutional neural networks to establish a mapping between feature maps and target likelihoods. Kou et al. [17] designed LW-IRSTNet, employing standard convolutions, depth-wise separable convolutions, dilated convolutions, and asymmetric convolutions to replace complex fusion modules. Liu et al. [18] proposed WSHNet, which weights IoU loss based on target scale and introduces center-based penalties for precise localization. Wu et al. [19] presented RepISDNet, a re-parameterized network that adopts different architectures for training and inference while maintaining equivalent parameters. Although these lightweight models achieve favorable real-time performance, their simplified architectures often limit feature representation capability, leading to missed detections and false alarms in complex scenes.

Another research direction emphasizes improving detection accuracy through complex network designs to prevent target feature degradation. Li et al. [20] proposed AGPCNet, which incorporates attention mechanisms to jointly model local semantic correlations and global contextual information. Dai et al. [21] introduced ALCNet, embedding traditional local contrast measurements into an end-to-end network as deep non-parametric nonlinear refinement layers to encode long-range contextual interactions. Zhang et al. [22] proposed ISNet, which enhances target-background contrast by aggregating and strengthening multi-level edge information. Zhou et al. [23] developed MDvsFA based on generative adversarial networks, decomposing the generator into dual sub-tasks to achieve Nash equilibrium. Wu et al. [24] proposed UIUNet, embedding a micro U-Net into a larger U-Net backbone to achieve multi-scale and multi-level feature representation. Wu et al. [25] further introduced an interpretable deep unfolding framework by modeling infrared small target detection as a relaxed RPCA problem. Pan et al. [26] proposed ABCNet, a bilinear correlation attention model that combines CNNs and Transformers to enhance target features while suppressing noise. Du et al. [27] developed IDNANet, an enhanced densely connected attention network that repeatedly fuses and reinforces contextual information to preserve fine target details.

While these methods significantly improve detection accuracy, the performance gains are often accompanied by increased computational complexity and memory consumption, making them difficult to deploy in real-time scenarios and highlighting the inherent trade-off between accuracy and efficiency.

3. Methodology

In this work, the term “compensation” refers to a feature enhancement strategy that supplements structural or spatial information weakened during deep feature extraction, rather than a specific mathematical operator. The proposed gradient-compensated framework aims to continuously reinforce edge-aware and multi-scale information across different stages of the network. Figure 1 illustrates the overall pipeline of the proposed GCFLNet. The network follows the encoder–decoder architecture of U-Net, fully exploiting its strengths in feature fusion and spatial information recovery. First, the infrared image is fed into the encoder for feature extraction. During this process, the encoder is jointly assisted by the Edge Enhancement Module (EEM), which suppresses background noise and accurately extracts fine target edge contours, providing effective edge information compensation for the encoded features. Specifically, the encoder first performs preliminary feature encoding through basic convolutional blocks. The resulting features are then processed by the Global–Local Feature Interaction (GLFI) module to capture long-range global dependencies and local detailed representations. Subsequently, the features extracted by GLFI are deeply fused with the edge-enhanced features generated by the EEM, producing enhanced representations that simultaneously preserve semantic information and fine edge details. During feature transmission and spatial reconstruction, the encoder and decoder are connected via skip connections. Along these pathways, the Multi-Scale Information Compensation (MSIC) module is employed to adaptively integrate high-level and low-level features, effectively compensating for the spatial information loss caused by down-sampling in the encoder and progressively restoring the spatial resolution of feature maps. Finally, the fused features are fed into the segmentation head, where pointwise convolution layers are used to adjust the channel dimensions and generate the final prediction, producing the segmentation results of infrared small targets.

3.1. Edge Enhancement Module

Edge contours in images represent the intrinsic boundary characteristics between targets and the background or between different homogeneous regions. They not only convey fundamental geometric shape and spatial location information of targets, but also serve as a critical discriminative cue for distinguishing targets from complex backgrounds in infrared small target detection. Unlike visible-light images, which are dominated by rich texture information, infrared images are formed based on thermal radiation differences between targets and backgrounds. As a result, infrared imagery typically exhibits blurred textures, a narrow gray-level dynamic range, and low contrast, rendering traditional texture-dependent detection methods ineffective. In contrast, edge contours generated by abrupt radiation changes between targets and backgrounds can clearly delineate the true shape and spatial extent of targets, effectively compensating for the lack of texture information. Therefore, edge features constitute a crucial entry point for infrared small target detection.

The distribution patterns of gradient vectors show significant variations across different infrared imaging scenes. In smooth background regions or cluttered noise areas, gradient vectors tend to be randomly distributed in both direction and magnitude, showing no clear structural regularity. In contrast, within regions containing infrared small targets, gradient vectors display a strong convergent pattern toward the target center. This distinctive distribution accurately characterizes the target’s edge locations and contour structures. Motivated by this observation, we propose an Edge Enhancement Module (EEM), which deeply integrates multi-scale gradient magnitude maps with features from the main encoder branch. By embedding gradient-based edge information into multi-level network features, EEM effectively compensates for the loss of fine edge details caused by progressive down-sampling in the encoder, guiding the network to focus on the structural boundaries of infrared small targets and significantly reducing the risk of targets being overwhelmed by complex backgrounds due to their small size and low contrast.

The EEM consists of two core components, an Edge Feature Extraction Unit (EFEU) and Central Difference Convolution (CDC) [28], as shown in Figure 2. In practice, multi-scale feature maps are fed into the EEM, where EFEU and CDC work collaboratively to extract accurate edge information while smoothing background noise. Specifically, EFEU employs an adaptive gradient vector computation mechanism to estimate the gradient direction and magnitude at each pixel location. By amplifying gradient differences between small targets and background clutter, EFEU extracts highly discriminative edge features from the original representations, enabling effective separation of targets from noise and faithfully preserving fine edge details. The EFEU employs a 3 × 3 convolution kernel to compute image gradients, where the Sobel operator is used to initialize the convolution weights. Specifically, the central column of the kernel in the x-direction and the central row of the kernel in the y-direction are fixed to zero, while the remaining parameters are dynamically optimized during training to adapt to diverse image characteristics and application scenarios. After obtaining the horizontal and vertical gradient responses at each spatial location, the gradient magnitude is calculated using the Euclidean norm.

However, while EFEU enhances fine target edges, it may introduce excessive smoothing of gradient information, potentially leading to the loss of subtle edge details. To address this limitation, CDC serves as a complementary component. Although central difference convolution was originally introduced for texture-sensitive tasks, in this work it is employed to emphasize local intensity variations and edge transitions rather than fine-grained texture patterns. This property is particularly suitable for infrared imagery, where structural gradients dominate over texture cues. By performing differential operations through weighted neighborhood aggregation, CDC effectively highlights abrupt intensity variations and exhibits strong responses to the overall location and orientation of target edges. Although CDC can be sensitive to background clutter in noisy scenes, its high computational efficiency and robust representation of global contours compensate well for the shortcomings of EFEU. Within the EEM, fine-grained gradient features extracted by EFEU are adaptively fused with robust contour features produced by CDC. This fusion strategy alleviates the over-smoothing issue inherent in single gradient-based methods while suppressing noise interference in edge detection. Consequently, the proposed EEM enhances discriminative edge structures without relying on texture characteristics and preserves detailed target boundaries without sacrificing robustness, producing stable and high-quality edge-enhanced features. These features provide reliable and discriminative guidance for subsequent detection stages, ultimately improving the overall performance of infrared small target detection.

3.2. Global–Local Feature Interaction Module

Existing infrared small target detection networks often suffer from an imbalance in feature modeling, where low-frequency global information is overemphasized while high-frequency global information is handled in a relatively coarse manner. This limitation results in insufficient fine-grained modeling of global information, making it difficult to effectively capture edge details and spatial correlations of infrared small targets. To address this issue, we revisit the complementary relationship between local and global features, as well as the intrinsic differences between convolutional neural networks (CNNs) and self-attention-inspired global context reasoning. Accordingly, we propose a Global–Local Feature Interaction Module (GLFI), which organically integrates globally shared convolutional weights with context-aware attention weights. This design enables precise modeling of high-frequency local details while efficiently capturing long-range dependencies across different spatial locations, thereby enhancing the model’s perception capability for infrared small targets.

As illustrated in Figure 3, the GLFI module simulates global attention via pooling-based weighting to achieve deep interaction between global and local features. Its core architecture consists of convolution layers, dilated convolution layers, and fully connected layers. Specifically, GLFI extracts local features through a parallel combination of standard convolution and dilated convolution with different dilation rates. Benefiting from dilated sampling, dilated convolution can significantly enlarge the receptive field without increasing computational cost, allowing for more effective modeling of long-range dependencies. However, due to the inherent non-local self-correlation properties of infrared images, the sparse sampling pattern of dilated convolution may lead to the loss of fine local information. To mitigate this issue, standard convolution is introduced as a complementary branch. Its dense sampling characteristic compensates for the information loss caused by dilated convolution, ensuring the integrity and richness of local feature representations.

While extracting local features, GLFI further introduces point-wise convolution (PWConv) to efficiently fuse the input features with the edge-enhanced features extracted by EEM. Leveraging globally shared weights, PWConv reinforces the guiding role of edge information in feature representation and enhances the sensitivity of the network to target boundaries. Subsequently, a multi-dimensional feature aggregation strategy combining adaptive max pooling and adaptive average pooling is employed to aggregate fused features along different spatial dimensions. Adaptive max pooling highlights the most salient local responses, strengthening critical gradient cues of infrared small targets, whereas adaptive average pooling smooths background clutter while preserving global statistical characteristics. By performing matrix multiplication on the outputs of these two pooling operations, GLFI generates context-aware attention weights that dynamically characterize the importance of features at different spatial locations.

Finally, the refined local feature maps are multiplied with the learned attention weights to achieve weighted fusion of local details and global contextual information, thereby establishing feature dependencies across the entire spatial domain. During inference, the GLFI module maintains a lightweight design while enabling the network to jointly model global context and high-frequency local details. This effectively enhances feature attention to potential small target regions and provides more discriminative feature representations for subsequent detection tasks.

3.3. Multi-Scale Information Compensation Module

In deep learning-based feature extraction frameworks, high-level and low-level features exhibit pronounced complementarity. High-level features, obtained through successive layers of abstraction and aggregation, encode rich semantic information and category-discriminative cues, enabling strong high-level semantic understanding and accurate characterization of a target’s intrinsic attributes. However, their spatial resolution is relatively low, leading to sparsification and loss of spatial information, which limits their ability to represent fine-grained edges and precise positional details. In contrast, low-level features preserve abundant low-level visual details and local spatial information, allowing accurate localization of subtle target structures. Nevertheless, these features primarily describe primitive data attributes, exhibit weak inter-channel semantic correlations, and contain substantial redundant information due to the lack of effective global contextual constraints. Conventional skip-connection mechanisms simply concatenate or add low-level encoder features to high-level decoder features, without adequately modeling the intrinsic relationships across different feature hierarchies. As a result, the fused representations often fail to jointly preserve semantic richness and spatial detail.

To address this limitation, we propose a Multi-Scale Information Compensation Module (MSIC), which enables efficient fusion of high- and low-level features through differentiated feature enhancement strategies. As illustrated in Figure 4, MSIC designs dedicated optimization paths tailored to the characteristics of features at different scales. For high-level features, a spatial attention mechanism is introduced to reweight spatial dimensions, enhancing spatial details and local structural information, thereby compensating for the inherent spatial resolution deficiency and improving the model’s perception of target location and contour structure. For low-level features, a channel attention mechanism is employed to adaptively learn the importance of each channel, emphasizing discriminative channel responses while suppressing redundant and irrelevant information, thus strengthening inter-channel semantic correlations and endowing low-level features with enhanced global contextual awareness. In MSIC, information compensation is implemented through attention-based feature modulation rather than direct concatenation or residual addition. Through this differentiated attention-driven enhancement and fusion strategy, the MSIC module adaptively exploits the complementary relationships between high- and low-level features, producing feature representations that simultaneously preserve rich semantic information and fine-grained spatial details. Moreover, it effectively suppresses redundant information and background noise, significantly improving the purity, discriminability, and effectiveness of the fused features.

4. Experiment

4.1. Experimental Data and Experimental Settings

Comprehensive experiments are conducted on two public infrared small target detection and segmentation benchmarks, IRSTD-1K and NUDT, to thoroughly evaluate the detection performance and generalization capability of the proposed GCFLNet. The IRSTD-1K dataset contains 1000 real-world infrared images collected under diverse scenarios, covering various target shapes, scale distributions, and complex cluttered backgrounds. All samples are provided with pixel-level accurate annotations, offering highly reliable supervision for model training and evaluation. The NUDT dataset is a large-scale single-frame infrared small target detection benchmark, consisting of 1327 images rendered by combining real infrared backgrounds with synthetic infrared targets. It includes representative complex scenes such as urban areas, fields, oceans, and cloud layers, enabling effective assessment of model robustness across different environments.

For comparative experiments, traditional methods are implemented on a workstation equipped with an Intel^® Core™ i5-8300H @ 2.30 GHz CPU and an NVIDIA GeForce GTX 1060 GPU, while deep learning–based methods are evaluated on a high-performance server with an Intel^® Xeon^® E5-2620 v4 @ 2.10 GHz CPU and an NVIDIA TITAN XP 12 GB GPU. For methods without publicly available code, the official results reported in their original publications are directly adopted for comparison. For methods with open-source implementations, retraining is performed without modifying network architectures or hyperparameter settings to ensure fair and objective comparisons. For the proposed GCFLNet, the batch size is set to 16 and the maximum number of training epochs is 1500. SoftIoULoss is employed to accommodate the pixel-level supervision requirements of infrared small target segmentation. The AdamW optimizer is adopted to improve training stability and mitigate overfitting. A multi-stage learning rate decay strategy is applied to dynamically adjust the learning rate during training. In addition, a deep supervision scheme is introduced by imposing supervisory signals at multiple network levels, further accelerating convergence and enhancing detection accuracy.

4.2. Performance Evaluation

Performance evaluation for infrared small target detection should jointly consider shape representation accuracy and localization reliability, as a single metric is insufficient to comprehensively reflect the overall capability of a model. Conventional pixel-level metrics widely used in object detection and segmentation, such as Intersection over Union (IoU), Precision, and Recall, mainly focus on quantifying the geometric similarity between predicted results and ground-truth annotations. However, infrared small targets are typically characterized by blurred contours, weak texture cues, and extremely limited spatial extent. Relying solely on shape-oriented metrics is therefore inadequate for accurately assessing a model’s localization performance. Additional task-specific metrics are required to establish a more complete evaluation framework. To this end, a multi-dimensional evaluation protocol is adopted in this work. IoU is used as the fundamental metric to quantify the pixel-wise overlap between the predicted segmentation and the ground-truth mask, reflecting the model’s ability to represent and fit the target shape. Meanwhile, Probability of Detection (P_d) and False Alarm Rate (F_a) are introduced as key localization-oriented metrics. P_d measures the effectiveness of target detection, while F_a evaluates the model’s resistance to background clutter and noise. Furthermore, Receiver Operating Characteristic (ROC) curves are plotted to illustrate the trade-off between P_d and F_a under different decision thresholds, enabling a more intuitive and objective comparison of detection performance across different methods.

IoU is a core shape evaluation metric for infrared small target segmentation. It quantifies the pixel-level overlap between the predicted segmentation region and the ground-truth annotation. This metric reflects the consistency between the predicted target shape and the true target shape. The IoU is calculated as follows:

I o U = \frac{A_{i n t e r}}{A_{u n i o n}},

(1)

where A_inter and A_union denote the intersection and concatenation of the predicted pixels with the real image pixels, respectively.

P_d is a core metric for evaluating localization accuracy in infrared small target detection. It is defined as the ratio of correctly detected target pixels to the total number of ground-truth target pixels. This metric quantifies the model’s ability to effectively identify infrared small targets. The P_d is calculated as follows:

P_{d} = \frac{T_{c o r r e c t}}{T_{a l l}},

(2)

where T_correct and T_all denote the number of correctly predicted target pixels and all target pixels, respectively.

F_a is a critical metric for measuring robustness against background clutter. It represents the proportion of background pixels incorrectly classified as target pixels over the entire image. The F_a is calculated as follows:

F_{a} = \frac{P_{f a l s e}}{P_{a l l}}

(3)

where P_false and P_all denote the number of pixels with prediction errors and the number of all image pixels, respectively.

4.3. Performance Comparison with Previous Methods

To comprehensively evaluate the detection performance of the proposed GCFLNet, extensive comparisons are conducted with representative infrared small target detection methods, including filter-based approaches (Top-Hat, Max–Median), low-rank based methods (RIPT, IPI), local contrast-based methods (TLLCM, WSLCM), and recent deep learning-based approaches (UIUNet, DNANet, RDIAN, etc.). All comparative methods strictly follow the experimental settings and parameter configurations reported in their original papers to ensure fairness and reliability.

As shown in Table 1, GCFLNet achieves consistently superior performance across all core metrics. In terms of IoU, GCFLNet attains 71.93% on the IRSTD-1K dataset, significantly outperforming the second-best method MSHNet (67.16%), and showing an order-of-magnitude improvement over traditional local contrast method. On the NUDT dataset, GCFLNet achieves the highest IoU of 86.49%, further demonstrating its strong capability in accurate target shape modeling and contour preservation. For detection probability, GCFLNet achieves a P_d of 93.27% on IRSTD-1K, slightly lower than MSHNet but markedly higher than other traditional and most deep learning methods. On the NUDT dataset, GCFLNet reaches the best P_d of 98.60%, indicating its excellent target detection capability under complex background conditions. In terms of false alarm rate, GCFLNet achieves the lowest F_a of 4.17 on NUDT, outperforming AGPCNet and significantly reducing false detections compared with DNANet and RDIAN. On IRSTD-1K, GCFLNet also maintains competitive F_a performance, resulting in a more balanced overall detection behavior. Unlike lightweight approaches that trade detection reliability for higher frame rates, GCFLNet leverages the efficient parallelism of its multi-branch GCFLNet architecture to achieve competitive inference speed while simultaneously enhancing detection accuracy and reducing false alarms, resulting in a better balance between performance and efficiency. Although NUDT contains synthetic targets, the consistent performance gains across both synthetic and real datasets suggest that GCFLNet captures robust gradient-based structural cues rather than dataset-specific generation patterns. To further analyze the trade-off between detection probability and false alarm rate, ROC curves of different methods are plotted on both IRSTD-1K and NUDT datasets, as illustrated in Figure 5. The ROC curve of GCFLNet rapidly approaches the ideal upper-left region, and its AUC value remains highly competitive, confirming its advantage in achieving high detection accuracy with low false alarms.

Although GCFLNet achieves robust performance in most challenging scenarios, it still exhibits limitations under certain extreme conditions. Specifically, when infrared small targets are embedded in backgrounds with highly similar gradient and intensity distributions, the edge responses may become indistinct, leading to occasional missed detections. In addition, under extremely low signal-to-clutter ratios or in the presence of strong structural clutter (e.g., dense cloud boundaries or stripe-like noise), gradient enhancement may introduce weak false responses. Moreover, targets approaching single-pixel scale remain challenging due to the lack of sufficient spatial context. These cases indicate potential directions for future improvement, such as incorporating stronger cross-scale context modeling or temporal information.

Figure 6 presents qualitative detection results of different methods. Traditional approaches suffer from limited handcrafted features and often fail under low-contrast targets or heavy background clutter, leading to frequent missed detections and false alarms. In contrast, GCFLNet accurately localizes small targets and produces precise segmentation results even in challenging scenarios with tiny targets and complex backgrounds. This performance gain mainly stems from the collaborative effect of internal modules. The EEM enhances target boundaries while suppressing background noise. The GLFI captures both global dependencies and local representations, preventing small target features from being diluted. The MSIC further enables adaptive fusion of high- and low-level features, improving detection accuracy and segmentation precision, and effectively reducing both missed detections and false alarms.

Overall, by jointly leveraging EEM, GLFI, and MSIC, GCFLNet significantly enhances feature representation and localization capability for infrared small targets, achieving superior detection accuracy and robustness under complex backgrounds.

4.4. Ablation Study

To investigate the individual contributions and synergistic effects of the core components in GCFLNet, ablation experiments are conducted on the NUDT dataset using U-Net as the baseline. By progressively incorporating EEM, GLFI, and MSIC into the baseline model, the impact of each module on detection performance is quantitatively analyzed. The experimental results are summarized in Table 2.

When only EEM is introduced, the model achieves notable improvements in IoU, P_d, and F_a compared with the baseline. This indicates that EEM effectively enhances the extraction of target boundary information while suppressing background clutter, thereby reducing both missed detections and false alarms. These results demonstrate that EEM plays a critical role in strengthening edge-aware representations and providing more discriminative features for subsequent processing.

When only GLFI is added, the improvement in IoU is relatively limited, whereas P_d and F_a exhibit clear optimization. This suggests that the primary strength of GLFI lies in capturing global–local relational features, which improves target localization accuracy and suppresses background-induced false alarms rather than directly refining target shape fitting. When EEM and GLFI are jointly integrated, the edge-enhanced features provided by EEM effectively complement the feature interaction process in GLFI, further reinforcing target representations and mitigating noise interference. As a result, all evaluation metrics are significantly improved, validating the strong complementarity and synergistic interaction between these two modules.

After incorporating MSIC into the baseline model, all metrics show consistent performance gains, with particularly notable improvements in P_d and F_a. This demonstrates that MSIC possesses strong multi-scale feature fusion capability, effectively compensating for spatial information loss caused by encoder down-sampling and enabling adaptive integration of high- and low-level features.

Further analysis of the pairwise module combinations provides deeper insight into their interactions. The integration of EEM and GLFI yields more evident overall improvements compared with the use of each module individually. The explicit boundary cues introduced by EEM offer reliable guidance for the global modeling process in GLFI, thereby facilitating more effective interaction between local structural details and global semantic information. The combination of EEM and MSIC preserves edge sensitivity while promoting cross-level information propagation, leading to more complete and discriminative feature representations. In contrast, the GLFI and MSIC configuration contributes more to structural coherence and semantic stability. However, its ability to suppress low-level background noise is relatively weaker than that of combinations involving EEM.

When all three modules are incorporated, the network achieves the best overall performance. Improvements are consistently observed across IoU, P_d and F_a compared with any single-module or dual-module configuration. These results indicate that the three components complement each other in terms of feature enhancement, dependency modeling, and multi-scale compensation.

5. Conclusions and Discussion

This work addresses key challenges in infrared small target detection under complex backgrounds and proposes an end-to-end network, GCFLNet, with a collaborative multi-module architecture. The Edge Enhancement Module strengthens boundary cues and suppresses background noise, the Global–Local Feature Interaction Module captures local structures and long-range contextual dependencies, and the Multi-Scale Information Compensation Module alleviates spatial information loss caused by encoder down-sampling. Experimental results on public datasets demonstrate consistent improvements over many methods in terms of IoU, probability of detection, and false alarm rate, validating the effectiveness and complementarity of the proposed modules. Despite its strong performance, GCFLNet still faces challenges under extreme conditions, such as highly similar target–background gradients, extremely low signal-to-clutter ratios, and near single-pixel targets. In addition, the proposed method focuses on infrared small target detection under natural complex background interference and does not explicitly address deceptive or adversarial interference in other sensing modalities [29]. Addressing these limitations remains an important direction for future research.

Author Contributions

Conceptualization, Y.W. and H.Z.; methodology, Y.W.; software, H.Z.; validation, Y.W. and H.Z.; formal analysis, Y.W. and H.Z.; resources, H.Z.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, X.Z. (Xinhao Zheng); visualization, Y.W.; supervision, Y.W.; project administration, Y.W.; funding acquisition, X.Z. (Xiangyue Zhang) All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Foundation of the State Key 607 Laboratory of Precision Space-time Information Sensing Technology (No. STSL2025-B-04-01(L)).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the results of this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. Signal Data Process. Small Targets 1999, 3809, 74–83. [Google Scholar]
Wang, X.; Peng, Z.; Kong, D.; He, Y. Infrared Dim and Small Target Detection Based on Stable Multi-subspace Learning in Heterogeneous Scene. IEEE Tran. Geosci. Sens. 2017, 55, 5481–5493. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Kang, W.; Zhao, W.; Liu, X. MLEDNet: Multi-Directional Learnable Edge Information-Assisted Dense Nested Network for Infrared Small Target Detection. Electronics 2025, 14, 3547. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-Field and Direction Induced Attention Network for Infrared Dim Small Target Detection With a Large-Scale Dataset IRDST. IEEE Tran. Geosci. Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3-8 January 2021; IEEE/CVF: Piscataway, NJ, USA, 2021; pp. 949–958. [Google Scholar]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Tran. Geosci. Sens. Lett. 2022, 19, 7506205. [Google Scholar] [CrossRef]
Ma, T.; Yang, Z.; Liu, B.; Sun, S. A Lightweight Infrared Small Target Detection Network Based on Target Multiscale Context. IEEE Tran. Geosci. Sens. 2023, 20, 7000305. [Google Scholar] [CrossRef]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust Infrared Small Target Detection Network. IEEE Tran. Geosci. Sens. 2022, 19, 7000805. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Yang, M.; Huang, F.; Fu, Q. LW-IRSTNet: Lightweight Infrared Small Target Segmentation Network and Application Deployment. IEEE Tran. Geosci. Sens. 2023, 61, 5621313. [Google Scholar] [CrossRef]
Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared Small Target Detection with Scale and Location Sensitivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17490–17499. [Google Scholar]
Wu, S.; Xiao, C.; Wang, L.; Wang, Y.; Yang, J.; An, W. RepISD-Net: Learning Efficient Infrared Small-Target Detection Network via Structural Re-Parameterization. IEEE Tran. Geosci. Sens. 2023, 61, 5622712. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Zhang, Z.; Peng, Z. ISNet: Shape Matters for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9221–9233. [Google Scholar]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE/CVF: Piscataway, NJ, USA, 2019; pp. 8509–8518. [Google Scholar]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Wu, F.; Zhang, T.; Li, L.; Huang, Y.; Peng, Z. RPCANet: Deep Unfolding RPCA Based Infrared Small Target Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4797–4806. [Google Scholar]
Pan, P.; Wang, H.; Wang, C.; Nie, C. ABC: Attention with Bilinear Correlation for Infrared Small Target Detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2381–2386. [Google Scholar]
Du, X.; Cheng, K.; Zhang, J.; Wang, Y.; Yang, F.; Zhou, W.; Lin, Y. Infrared Small Target Detection Algorithm Based on Improved Dense Nested U-Net Network. Sensors 2025, 25, 814. [Google Scholar] [CrossRef]
Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching Central Difference Convolutional Networks for Face Anti-Spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5294–5304. [Google Scholar]
Chang, S.; Tang, S.; Deng, Y.; Zhang, H.; Liu, D.; Wang, W. An Advanced Scheme for Deceptive Jammer Localization and Suppression in Elevation Multichannel SAR for Underdetermined Scenarios. IEEE Trans. Aerosp. Electron. Syst. 2025. early access. [Google Scholar]

Figure 1. Overall architecture of GCFLNet.

Figure 2. Structure of Edge Enhancement Module (EEM).

Figure 3. Structure of the Global–Local Feature Interaction Module (GLFI).

Figure 4. Structure of the Multi-Scale Information Compensation Module (MSIC).

Figure 5. ROC performance of different methods on (a) NUDT and (b) IRSTK-1K datasets, respectively.

Figure 6. Visual examples of some representative methods. The accurately identified targets, false alarms, and missed detections are, respectively, indicated by green, red, yellow boxes for ease of identification. The red box shows the magnified target.

Table 1. Experimental results on the NUDT and IRSTD-1k dataset. Comparison of detection performance [IoU (%), P_d (%) and F_a (×10⁻⁶)] and model efficiency (the number of parameters (M) and FPS) of different methods on the IRSTD-1K and NUDT datasets.

Method	Description	IRSTD-1K			NUDT			FPS		Params
Method	Description	IoU	P_d	F_a	IoU	P_d	F_a	Matlab2016a	Python3.7	Params
Top-Hat	Filtering	12.05	75.52	1030.44	24.50	72.66	46.08	61.80	-	-
Max–Median	Filtering	6.998	65.21	59.73	4.197	58.41	36.89	22.34	-	-
RIPT	Low Rank	10.22	72.75	35.90	29.44	91.85	344.3	0.78	-	-
IPI	Low Rank	15.12	75.86	37.21	17.76	74.49	41.23	0.03	-	-
TLLCM	Local Contrast	3.311	77.39	6738	2.176	62.01	1608	0.08	-	-
WSLCM	Local Contrast	3.452	72.44	6619	2.283	56.82	1309	0.04	-	-
UIUNet	Deep Learning	61.82	87.24	19.00	86.00	98.83	1.23	-	7.89	50.54
DNANet		68.18	91.83	6.15	83.94	98.52	6.21	-	17.89	4.70
RDIAN		63.45	90.91	19.09	70.87	93.65	67.21	-	34.25	0.22
AGPCNet		65.15	90.24	15.01	85.40	98.10	4.72	-	20.63	12.36
RPCANet		62.80	88.32	11.54	88.53	96.26	25.84	-	15.70	0.68
MSHNet		67.16	93.88	15.03	80.55	97.99	11.77	-	29.57	4.05
Proposed		71.93	93.27	12.75	86.49	98.60	4.17	-	25.04	5.86

Table 2. Ablation Study of the EEM, GALF and MSIC.

EEM	GLFI	MSIC	IoU	P_d	F_a
×	×	×	76.49	86.32	18.24
√	×	×	85.43	95.31	12.44
×	√	×	81.24	97.13	5.84
×	×	√	83.26	98.32	5.26
√	√	×	85.93	98.12	4.95
√	×	√	86.10	98.35	4.60
×	√	√	84.61	98.39	4.75
√	√	√	86.49	98.60	4.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, H.; Zhang, X.; Zheng, X. A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection. Electronics 2026, 15, 868. https://doi.org/10.3390/electronics15040868

AMA Style

Wang Y, Zhang H, Zhang X, Zheng X. A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection. Electronics. 2026; 15(4):868. https://doi.org/10.3390/electronics15040868

Chicago/Turabian Style

Wang, Yanwei, Haitao Zhang, Xiangyue Zhang, and Xinhao Zheng. 2026. "A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection" Electronics 15, no. 4: 868. https://doi.org/10.3390/electronics15040868

APA Style

Wang, Y., Zhang, H., Zhang, X., & Zheng, X. (2026). A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection. Electronics, 15(4), 868. https://doi.org/10.3390/electronics15040868

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Model-Driven Detection Method

2.2. Data-Driven Detection Method

3. Methodology

3.1. Edge Enhancement Module

3.2. Global–Local Feature Interaction Module

3.3. Multi-Scale Information Compensation Module

4. Experiment

4.1. Experimental Data and Experimental Settings

4.2. Performance Evaluation

4.3. Performance Comparison with Previous Methods

4.4. Ablation Study

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI