CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection

Zhu, Ziwen; Liao, Mengmeng

doi:10.3390/electronics15112405

Open AccessArticle

CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection

by

Ziwen Zhu

¹ and

Mengmeng Liao

^2,*

¹

Department of Mathematics, University of Florida, Gainesville, FL 32611, USA

²

School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2405; https://doi.org/10.3390/electronics15112405

Submission received: 20 April 2026 / Revised: 26 May 2026 / Accepted: 31 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Abstract

Infrared small target detection (IRSTD) is a task in target detection and computer vision that remains challenging but also critical. The cause of its complexity and difficulty lies in the inherent features of this class of targets, as most of the dataset has extreme class imbalance, weak classification contrast, and complex noise clutter in the background. Focusing on these existing issues, this work proposes CG-IRNet, a structure-aware detection framework that integrates multi-scale feature aggregation with Structure–Confidence Hybrid (SCH) loss, which integrates an augmented variant of confidence-aware Scale–Location Sensitive (SLS) loss with instance-wise structural supervision and a confidence-guided background suppression mechanism, which are all targeted towards enhancing localization consistency while largely reducing false alarms. In addition to these, a frequency-aware feature refinement module is incorporated to strengthen small target saliency under highly cluttered scenes. This work included a series of extensive experiments across three benchmark datasets included in SIRST, namely IRSTD-1K, NUAA-SIRST, and NUDT-SIRST. These experiments demonstrate a superior trade-off between detection probability (Pd) and false alarm rate. On IRSTD-1K, CG-IRNet achieves 65.09 mIoU and reduces the false alarm rate to 30.992 × 10⁻⁶, which is significantly lower than SCTransNet (55.74 × 10⁻⁶) at the same detection probability (93.27%). On NUAA-SIRST and NUDT-SIRST, the proposed method achieves 96.95% and 98.62% detection probability, respectively, while maintaining competitive or lower false alarm rates under challenging background conditions. These outcomes effectively demonstrate the improvements achieved in this work and the effectiveness of the proposed confidence-guided suppression and structure-aware optimization. Also included in the group of experiments performed in this work is the ablation study on model hyperparameters and qualitative analyses, which further confirm the joint improvements contributed by the proposed structural supervision and confidence-aware design, particularly in regimes where a low false alarm rate is the goal of optimization.

Keywords:

infrared small target detection; structure-aware loss; confidence-guided suppression; multi-target supervision; false alarm reduction

1. Introduction

Infrared small target detection strategies are a critical yet challenging technique in a large number of applications across computer vision fields of study. Some common examples include long-range surveillance, early-warning systems (e.g., self-driving vehicles), and space-based remote sensing. In these situations, either the target always only has an extremely small size of about a few pixels or it needs to be detected before it emerges as a dominant visual feature. The direct consequence of these features is that targets are submerged in complex, noisy, and cluttered backgrounds [1,2]. The major differences between general target detection and IRSTD tasks are extreme class imbalance, low target contrast, and strong background interference, which makes effective false alarm suppression equally critical for accurate localization for a truly practical system [3].

Although a large group of recent methods based on deep learning greatly improve detection accuracy, false alarm rate remains overlooked and bottlenecked, which makes it a barrier to implement these methods into real-world scenarios where the cost of false alarm rates may equal or exceed the cost of a missing target, especially in critical safety scenarios where even a small number of reports of detection can trigger severe consequences [3]. Older IRSTD methods rely mostly on handcrafted features, including contrast operators, top-hat, and low-rank sparse decomposition [4,5,6], which, though computationally cheap and efficient, lack robustness against complex background clutter and often have sub-optimal results for realistic environments.

With the rise of deep learning, models based on convolutional neural networks (CNNs) have become the dominant paradigm for IRSTD, with representatives like ACM [7], ALCNet [8], DNA-Net [9], and UIU-Net [10]. The common backbone of these models is U-net style encoder–decoder structures [11] with multi-scale feature fusion and deep supervision, and these models are capable of preserving small target structures. However, CNN-based approaches are naturally limited by local receptive fields, which act as a major limiter of their ability to capture long-range dependencies and consistencies that only exist on a global scale in the background. As a result, background structures with similar local contrasts are often misclassified as targets, leading to elevated false alarm rates [12].

Motivated by this issue, transformer-based or hybrid CNN-Transformer architectures have been introduced to the field. A great representative is SCTransNet [13], which utilizes spatial–channel cross-attention to capture global contextual relationships across multiple feature levels. While Pd and IoU metrics are impressive for these methods, they do not explicitly address the false alarm problem, and a lower false alarm rate is just a byproduct of more accurate results. In practice, though, many existing approaches implicitly optimize their configuration for segmentation accuracy without directly modeling the reliability of predictions or enforcing conservative behavior in ambiguous regions. As a result, the gap between high detection accuracy performance, which is reflected by Pd and IoU, and low false alarm rate aspiration still exists, particularly under strict operational constraints.

Other than the model architecture, existing loss functions such as binary cross-entropy (BCE) loss and its variants mainly focus on pixel-level evaluation and classification, which intrinsically lack modeling for global aspects such as large target structure, scale consistency, and multi-target interactions. Although newer loss functions like SLS loss have been proposed to improve localization issues, they do not incorporate prediction confidence and multi-target compatibility, which are essential for distinguishing reliable detections from background-induced noisy activations.

Unlike existing IRSTD methods, which mainly focus on feature representation and segmentation overlap, CG-IRNet is designed to operate around the requirements of low false alarm rates. Compared with SCTransNet, which effectively addresses the problem of global context modeling through spatial–channel cross-attention, the proposed work further introduces an explicit confidence branch to modulate final predictions by suppressing unreliable background activations. Compared with original SLS loss, the proposed SCH loss expands the original method to multi-target conditions and, to incorporate the model’s confidence branches, also introduces confidence-aware background regularization. Compared with wavelet-based feature enhancement methods, the proposed CG-IRNet does not use frequency information alone but combines Haar wavelet domain refinement with confidence-gated prediction and structure-aware optimization. In short, the novelty of this method lies in the joint design of representation refinement, confidence calibration, and structure-aware, low-Fa supervision.

To address these issues, this work proposes CG-IRNet, a structure-aware and confidence-guided infrared small target detection framework designed to explicitly balance detection performance and false alarm suppression. The contribution of this work is fourfold. First, CG-IRNet introduces a confidence-guided detection framework that shifts the optimization focus from segmentation accuracy alone to reliable low-false-alarm prediction. Second, the proposed Structure–Confidence Hybrid (SCH) loss extends the SLS formulation by combining pixel-wise BCE supervision, instance-wise multi-target structural supervision, and confidence-aware background regularization. Third, a confidence-aware prediction head is designed to decouple target localization from prediction reliability, allowing uncertain background responses to be softly suppressed during inference. Fourth, a Frequency-Aware Feature Refinement (FAFR) module combines local contrast enhancement with Haar wavelet domain convolution to strengthen target–background discrimination under cluttered scenes. Extensive experiments on IRSTD-1K, NUAA-SIRST, and NUDT-SIRST further validate the proposed framework through quantitative comparison, ROC analysis, ablation, computational complexity analysis, α sensitivity analysis, and failure case analysis.

2. Related Work

2.1. Infrared Small Target Detection

Traditional IRSTD methods focus on enhancing local contrast between targets and backgrounds using handcrafted operators, such as top-hat filtering, weighted local contrast measures, and patch-based statistical modeling [4,5,6]. These methods are computationally efficient and interpretable but degrade rapidly in the presence of structured clutter or non-uniform noise, which are common in real scenarios.

Deep-learning-based IRSTD methods have become mainstream in recent years. CNN-based models typically employ encoder–decoder structures with residual blocks and deep supervision to preserve small target details [7,8,9,10]. Representative work includes ACM [7], ALCNet [8], DNA-Net [9], and UIU-Net [10]. These models introduce multiple techniques, including attention mechanisms, multi-scale fusion, or nested U-Net designs to improve detection robustness. Despite their improvements, these models often rely on local receptive fields and lack explicit design to enforce a continuous global background, which often leads to elevated false alarm rates in complex scenes [12]. A recent comprehensive review further summarizes that balancing detection sensitivity and false alarm suppression remains a core challenge in modern IRSTD systems [1].

Transformer-based approaches have been explored recently to address long-range dependency modeling. SCTransNet [13] introduces spatial–channel cross-attention to enable full-level semantic interaction across encoder features, significantly improving contextual awareness. Recent work further extends this direction by integrating hierarchical transformer backbones and multi-scale feature aggregation strategies to enhance weak target representation against complex backgrounds [14]. However, though they have improved IoU, a high false alarm rate remains a barrier to be overcome, which may not align with safety-oriented deployment requirements [3].

More recently, foundation model and multimodal approaches have begun to emerge in IRSTD. MIRSAM [15] adapts SAM (Segment Anything Model) to infrared small target detection by combining a contourlet denoising adapter with CLIP-based text prompts, showing that language guidance can provide useful contextual cues such as target location, shape, and background information. However, these methods often require additional text–image paired data, prompt construction, or foundation model adaptation. In contrast, CG-IRNet remains an image-only supervised framework that focuses on reducing false alarms through confidence-aware prediction, structure-guided loss design, and frequency-aware feature refinement.

2.2. Local Contrast Enhancement and Feature Refinement

Local contrast enhancement has long been recognized as a core principle in IRSTD [6,7]. Specifically, the LCAE-Net [8] project formalized this idea by utilizing a deep learning framework by introducing a local contrast attention mechanism, which uses fixed directional operators to generate contrast-aware spatial weights. Their approach effectively embeds classical IR priors, like preferred lighter spots, smoother background compared to the target, and sharp local contrast, into CNNs and has been shown to improve robustness against background clutter. Our proposed CG-IRNet adopts the framework of this mechanism as a principled front-end feature enhancement stage and further extends it to intermediate feature refinement.

In addition to feature refinement utilizing classical CNN, frequency–domain analysis, particularly wavelet transforms, has also been widely used in image restoration and IR target enhancement. The wavelet variant of CNNs demonstrated that discrete wavelet transforms can be seamlessly integrated into deep networks to separate low-frequency background structures from high-frequency target responses [16]. Recent IRSTD studies further confirm that wavelet-based feature processing improves small target saliency [17]. More recent work further explores edge-aware and multi-scale feature enhancement strategies to compensate for information loss during deep feature propagation and improve robustness under cluttered backgrounds [18]. CG-IRNet leverages this insight by embedding depth-wise convolution directly in the wavelet domain, enabling frequency-aware feature modulation while maintaining spatial resolution.

2.3. Loss Functions and Confidence-Aware Learning

Binary cross-entropy (BCE) loss and its logit-based variants remain the standard supervision signals for IRSTD due to their simplicity and stability under class imbalance [19]. However, BCE alone is insufficient to encode geometric and scale-related constraints. Scale–Location Sensitive (SLS) loss, compared to BCE loss, was introduced to explicitly model target scale consistency and localization accuracy, achieving improved robustness in multi-scale scenarios [20].

Recent studies in uncertainty modeling and confidence learning suggest that explicitly estimating prediction confidence can improve reliability and reduce false positives [21,22]. More recent approaches further emphasize the importance of global–local feature interaction and multi-scale compensation mechanisms to enhance detection robustness while suppressing background interference [23]. In parallel, emerging architectures introduce hybrid perception encoders and non-local aggregation strategies to better capture long-range dependencies and reduce structured clutter responses [24]. By transferring these ideologies into the current architecture, the proposed CG-IRNet utilizes new Structure–Confidence Hybrid (SCH) loss, which combines BCE loss with an augmented SLS formulation that enables multi-target compatibility and additional confidence-aware penalties, encouraging the network to suppress overconfident background activations (by learning them as targets with low confidence) while preserving confident target responses. Importantly, the augmentation retains the original SLS scale logic, ensuring theoretical consistency with prior work, but with more compatibility targeted to multi-target scenarios and with a guide for safer actuation.

3. Methodology

3.1. Overall Architecture of CG-IRNet

The proposed CG-IRNet shown in Figure 1 is an end-to-end infrared small target detection framework designed around the following three complementary principles: contrast-guided feature enhancement, frequency-aware representation learning, and confidence-aware prediction augmentation. Rather than optimizing solely for pixel-wise segmentation accuracy, the proposed architecture explicitly emphasizes conservative decision making to reduce false alarms while preserving detection probability.

The main network adopts a multi-scale encoder–decoder structure with global semantic interaction across feature levels. If we let X ∈ ℝ^H×W denote the input infrared image, the encoder–decoder backbone produces multi-scale feature maps:

E_{k} = ε_{k} (X), k = 1, \dots, 4

(1)

O_{k} = D_{k} (E_{1}, \dots, E_{4})

(2)

where: ε_k = encoder stage; 0_k = decoder stage.

The structure follows recent trends in transformer-enhanced IRSTD models, where long-range dependencies are critical for distinguishing true targets from structured background clutter. Within this general backbone, CG-IRNet integrates contrast and frequency priors directly into the feature extraction pipeline and decouples localization from confidence estimation at the prediction stage.

3.2. Multi-Scale Encoder–Decoder Backbone

The backbone of CG-IRNet follows a hierarchical encoder–decoder paradigm improved and utilized in the previous SCTransNet [13] project, in which features are progressively abstracted and then refined through multi-scale fusion. Each encoder stage consists of residual convolutional blocks that preserve fine-grained spatial information while enabling deeper semantic representation. Corresponding decoder stages recover spatial resolution through up-sampling and feature fusion, ensuring that small target structures are not lost during down-sampling.

Skip connections are implemented between the encoder and decoder stages to maintain spatial continuity. This is essential for infrared targets that often occupy only a few pixels. This backbone provides a stable foundation for subsequent contrast- and frequency-aware modification and refinement.

3.3. Cross-Scale Semantic Interaction

To enhance global context modeling, CG-IRNet incorporates cross-scale semantic interaction modules. By projecting features from different encoder stages into a unified latent space and applying channel-wise attention to capture long-range dependencies [25] and inter-scale correlations, these modules enable information exchange across multiple feature resolutions.

By allowing semantic information from deeper layers to guide shallow representations and vice versa, the network thus gains improved discrimination between true targets and background structures that exhibit similar local contrast but differ in global consistency.

3.4. Frequency-Aware Feature Refinement (FAFR)

A central component of CG-IRNet is the integration of contrast-guided feature enhancement with wavelet domain frequency analysis.

Local contrast enhancement is applied to emphasize intensity discontinuities characteristic of infrared small targets. Fixed directional operators are used to compute contrast responses, which are then transformed into attention maps that modulate feature activations. This design embeds classical IR detection priors into the network without introducing additional trainable parameters. Mathematically, let

F

be an intermediate feature map, and then the Directional contrast responses are computed as

C_{d} (F) = F - N_{d} (F), d \in {1, \dots, D}

(3)

where

𝒩

_d(·) denotes the directional neighborhood smoothing operator.

The contrast attention map is then

A (F) = σ (\sum_{d = 1}^{D} w_{d} C_{d} (F))

(4)

where C_d(F) = contrast response along direction d; w_d = learnable scalar weight assigned to that directional response.

And, the refined feature F′ is

F^{'} = F ⨀ A (F)

(5)

where σ = sigmoid; ⨀ = element-wise multiplication.

To further improve background suppression, CG-IRNet performs feature refinement in the wavelet domain. Feature maps are decomposed into frequency sub-bands using a discrete wavelet transform, followed by depth-wise convolution applied independently to each sub-band. Learnable scaling factors allow the network to adaptively weight low- and high-frequency components before reconstructing the spatial representation via inverse wavelet transform.

In this work, the discrete wavelet transform is implemented using an orthonormal Haar wavelet. Given an input feature map, the Haar transform decomposes it into four sub-bands denoted by LL, LH, HL, and HH, corresponding to low-frequency, approximation, horizontal detail, vertical detail, and diagonal detail components. The Haar wavelet is selected due to its lightweight, parameter-free features. In addition, during implementation, it is well-suited to separating smooth background components from sharp local responses produced by small infrared targets.

Mathematically, the Discrete wavelet transform is

{F_{L L}, F_{L H}, F_{H L}, F_{H H}} = W (F^{'})

(6)

with each sub-band processed independently using depth-wise convolution:

{\tilde{F}}_{b} = γ_{b} \cdot D e s c r e t e w a v e l e t C o n v (F_{b}), b \in {L L, L H, H L, H H}

(7)

where γ_b are learnable scaling factors. Then, the inverse transform is defined as

F ″ = W^{- 1} ({\tilde{F}}_{L L}, {\tilde{F}}_{L H}, {\tilde{F}}_{H L}, {\tilde{F}}_{H H})

(8)

which allows adaptive frequency-aware feature modulation.

This joint contrast–frequency refinement enables CG-IRNet to suppress background clutter more effectively than purely spatial convolutions while preserving fine target details.

The learnable scaling factor γ_b is initialized to 1.0 for every wavelet sub-band. This initialization makes the wavelet branch start from a neutral band-weighting state, so no frequency component is manually suppressed at the beginning of training. During optimization, γ_b is updated jointly with the rest of the network and learns to adjust the relative contribution of each band. As shown in Table 1, in the final IRSTD-1K checkpoint, the learned mean values of γ_b for the active FAFR module are 1.1120, 0.9454, 0.9438, and 0.8895 for LL, LH, HL, and HH respectively. This indicates that the model slightly emphasizes the low-frequency structural component while moderately suppressing high-frequency detail bands, which is consistent with the goal of reducing clutter-induced false responses while preserving target saliency.

3.5. Confidence-Aware Prediction Head

Unlike most existing IRSTD models that only rely on pixel-wise logits, CG-IRNet introduces an additional isolated prediction pathway for confidence-related features. In addition to the segmentation map, the network produces a confidence map through this head that estimates the reliability of predicted target responses. Let

z

denote segmentation logits and z_c denote confidence logits:

p = σ (z), q = σ (z_{c})

(9)

A confidence gate is then constructed as

g = α + (1 - α) q

(10)

Then the final prediction becomes

p_{f i n a l} = p ⨀ g

(11)

and the restored final logits are

z_{f i n a l} = l o g \frac{p_{f i n a l}}{1 - p_{f i n a l}}

(12)

And the final binary mask is generated by

\hat{Y} = I (p_{f i n a l} > 0.5)

(13)

This decoupling allows the model to distinguish between confident detections (actuation with high confidence in the mask) and ambiguous activations (actuations with low confidence in the mask) that may arise from background noise. During training, confidence information is incorporated into the confidence term in the augmented SLS loss function to penalize overconfident false positives. During inference, the confidence map is not used as an additional manually tuned threshold; it performs soft probability modulation before the final fixed threshold is applied. In short, the inference follows the sequence of z_s, z_c → p, q → g → p_final → Y_hat, where only the final step applies the fixed threshold of 0.5.

3.6. Training Strategy and Loss Design

CG-IRNet is trained using a staged optimization strategy to accommodate different learning objectives at different phases of training.

In the initial phase, training is dominated by weighted binary cross-entropy loss to establish stable foreground–background separation under extreme class imbalance. Once convergence is achieved, training transitions to a structure-aware optimization phase based on a scale- and location-sensitive loss formulation. We introduce Structure–Confidence Hybrid (SCH) loss, which augments SLS with BCE supervision and confidence-aware regularization. This loss is further augmented, as shown in Figure 2, to support multi-target scenarios and confidence-aware regularization, encouraging precise localization while discouraging spurious detections. Mathematically, let y ∈ {0,1}^H×W denote the ground-truth mask and p ∈ (0,1)^H×W denote the predicted probability map. The weighted binary cross-entropy (BCE) loss is defined as

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [β y_{i} l o g (p_{i}) + (1 - y_{i}) l o g (1 - p_{i})]

(14)

where

β = m i n (\frac{N_{n e g}}{N_{p o s}}, β_{m a x})

is the positive-class weight;

N

is the total number of pixels.

Then, for the SLS loss [20], it can be abstracted as

L_{S L S} = \frac{1}{K} \sum_{k = 1}^{K} L_{s c a l e} ({\hat{τ}}_{k}, τ_{k}) + L_{l o c a t i o n} ({\hat{τ}}_{k}, τ_{k})

(15)

where τ_k denotes the

k

-th ground-truth target instance and

{\hat{τ}}_{k}

is its corresponding predicted region.

Then, to make the SLS compatible with multiple targets and CG-IRNet’s confidence head, the original loss function is augmented as follows.

For the multi-target part, let the ground-truth mask contain

K

connected components:

y = ⋃_{k = 1}^{K} y^{(k)}

(16)

where y^(k) denotes the k-th target instance. Instead of computing structure consistency globally, we compute instance-wise loss as

L_{S L S} = \frac{1}{K} \sum_{k = 1}^{K} L_{s t r u c t} (p_{f i n a l}, y^{(k)})

(17)

This ensures that each target is independently supervised, no large targets dominate loss, and it maintains compatibility for multi-target pictures.

Then, to utilize the confidence head from CG-IRNet, a confidence term is added to SLS loss to penalize high confidence assigned to background pixels:

L_{c o n f} = \frac{1}{N} \sum_{i = 1}^{N} (1 - y_{i}) q_{i}

(18)

As shown in the formula, this encourages q_i → 0 when y_i = 0, which enforces conservative background predictions.

As a result, the augmented SLS loss is obtained by combining these terms:

L_{S L S}^{a u g} = L_{S L S} + λ_{c o n f} L_{c o n f}

(19)

To jointly optimize pixel-level discrimination, structural consistency, and confidence-aware suppression, a unified loss function termed Structure–Confidence Hybrid (SCH) loss is introduced. SCH loss combines the binary cross-entropy (BCE) loss with the proposed augmented structure-aware loss:

L_{S C H} = {λ_{B C E} L}_{B C E} + λ_{S L S} L_{S L S}^{a u g}

(20)

Unlike the original SLS formulation [20], this proposed Structure–Confidence Hybrid (SCH) loss explicitly incorporates instance-wise supervision and confidence-gated modulation, enabling multi-target compatibility while promoting conservative background behavior. For simplicity, the formulation is referred to as SCH loss for the remainder of the paper.

By separating coarse discrimination from fine localization and confidence calibration, the proposed training strategy aligns the optimization process with the goal of reliable, low-false-alarm infrared target detection.

4. Experimental Setup

4.1. Datasets

Evaluation of CG-IRNet was performed on the SIRST-V2 benchmark [26], which is a widely used series of datasets for single-frame infrared small target detection. It includes three major datasets:

IRSTD-1K: A relatively large-scale infrared small target dataset that contains diverse background scenes with a variety of clutter complexity and target intensities.

NUAA-SIRST: A widely used infrared small target dataset characterized by its average small target sizes and noisy background conditions.

NUDT-SIRST: An infrared small target dataset containing complex background textures and diverse target appearances.

All datasets provide pixel-level ground-truth annotation labels. During training and evaluation, images are processed independently without temporal information.

4.2. Implementation Details

All experiments were implemented using PyTorch under version 2.3.0+rocm6.2.3. The network is trained in an end-to-end manner, with mixed-precision acceleration enabled when available.

The network adopts an encoder–decoder backbone with four resolution stages, where residual convolutional blocks are employed at each stage to preserve spatial information while enabling deeper feature representation. Frequency-aware feature refinement (FAFR), which incorporates wavelet domain processing and local contrast enhancement modules, is enabled by default to enhance target saliency under cluttered backgrounds. Also, deep supervision heads are applied during training to facilitate optimization but are removed during inference to reduce computational overhead.

For optimization, the Adam optimizer is used with a cosine annealing learning rate schedule, including a warm-up phase during the initial Logit BCE training stage. The model is trained with a batch size of 16 for a total of 1000 epochs.

To stabilize training under extreme class imbalance, the positive class weight in the binary cross-entropy loss is automatically estimated from the training masks and clipped to an upper bound of 5 to prevent numerical instability.

4.3. Training Strategy

CG-IRNet is trained using a two-phase optimization strategy:

Phase I: BCE-dominant training

In the first phase, the network is trained primarily using weighted binary cross-entropy loss. This phase focuses on establishing reliable foreground–background separation and stabilizing early feature learning.

Phase II: Structure–Confidence Hybrid loss fine-tuning

In the second phase, training transitions to a structure-aware optimization regime dominated by Structure–Confidence Hybrid (SCH) loss (with SLS component weight 0.8 and BCE weight 0.2) to maintain training stability. This phase emphasizes precise localization and suppresses isolated false activations. The optimizer state is reset at the phase transition to avoid momentum mismatch, and training is monitored using metrics that explicitly penalize false alarms. Model selection during training is based on a composite evaluation criterion rather than a single accuracy metric.

4.4. Evaluation Protocol

Thresholding: Prediction maps are binarized using a fixed threshold of 0.5. This avoids tuning thresholds on the test set and ensures fair comparison across models.

Object-level evaluation: Connected components are extracted from the binary prediction map. Each predicted component is treated as a detected target, and its centroid is compared with ground-truth centroids for object-level evaluation.

4.5. Evaluation Metrics

Model performance is evaluated using both pixel-level and object-level metrics from the binary prediction masks, which are obtained by applying a fixed threshold to the model output. Also, no metric-specific threshold tuning is performed on the test set.

Let P ∈ {0,1}^H×W be the predicted binary mask and G ∈ {0,1}^H×W the corresponding ground-truth mask.

4.5.1. Pixel-Level Metrics

Intersection-over-Union (IoU)

The IoU score is obtained using global pixel accumulation across the entire test set. Let TP, FP, and FN be the total number of true positive, false positive, and false negative pixels aggregated over all test images. Then, the IoU is defined as

IoU = \frac{T P}{T P + F P + F N}

(21)

This metric corresponds to a micro-averaged IoU, which reflects overall pixel-level overlap performance.

Normalized IoU (nIoU)

Then, to reduce bias toward images with larger targets, we additionally report the normalized IoU (nIoU). This metric is calculated as the mean IoU over images:

nIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(22)

where TP_i, FP_i, and FN_i are pixel counts for the i-th image.

For images where both prediction and ground-truth are empty (TP_i + FP_i + FN_i = 0) the IoU is defined as a set value of 1.0 for numerical stability and to fit the IRSTD protocols.

4.5.2. Object-Level Metrics

Probability of Detection (Pd)

Pd measures object-level detection performance [3]. Connected components are extracted from both the prediction and ground-truth masks using 8-connectivity. Each component is represented by its centroid.

In this work, 8-connectivity is chosen rather than 4-connectivity due to the size of infrared small targets, which often occupy only a few pixels and may contain diagonal adjacency after thresholding. Under 4-connectivity, a compact target with diagonal pixel contact can be artificially split into multiple components, which would distort object-level Pd and false alarm counting. In contrast, 8-connectivity better preserves the spatial integrity of small target blobs and is more consistent with the centroid-based matching protocol used in this work.

Greedy one-to-one matching is carried out between predicted and ground-truth centroids. A predicted centroid is considered a correct detection if its Euclidean distance to a ground-truth centroid is less than the fixed tolerance, which is set to τ = 3 pixels. Also, a regulation that each predicted component can match at most one ground-truth component is applied.

Pd is defined as

Pd = \frac{N_{matched}}{N_{GT}}

(23)

where N_matched is the count of matched ground-truth targets and N_GT is the total number of ground-truth targets.

False Alarm Rate (Fa)

The false alarm rate is defined as [3] the pixel-level false positive density:

Fa = \frac{F P}{\sum_{i = 1}^{N} H_{i} W_{i}}

(24)

where FP is the total number of false positive pixels and H_iW_i denotes the number of pixels in the i-th image. For numerical readability, Fa is additionally reported as Fa × 10⁻⁶. Notably, this definition reflects background false activation density and differs from object-count-based false alarm metrics, which may be used in some prior work.

4.6. Multi-Model Comparison and Reporting

For fair comparison across multiple models and training runs, all evaluation results are stored in a unified format and aggregated on a per-dataset basis. To ensure comparability, performance metrics from different models are collected with consistent settings. Receiver operating characteristic (ROC) curves, which are defined in terms of detection probability (Pd) versus false alarm rate (Fa), are generated by varying the decision threshold over a predefined range.

For matched Pd analysis, interpolated Fa rates using model data on a range of probability levels achievable for all candidate models are reported, providing direct comparison under equivalent operating conditions.

5. Results

5.1. Quantitative Comparison with Other Methods

To evaluate the effectiveness of CG-IRNet, it is compared with several recent state-of-the-art infrared small target detection methods on three benchmark datasets, including IRSTD-1K, NUAA-SIRST, and NUDT-SIRST. The detailed quantitative results are presented in Table 2.

All values in Table 2 are computed using the unified evaluation protocol in this work. Deep learning baselines are retrained from available implementations; traditional methods are reproduced under the same metric implementation; and modifications are limited to input and metrices compatibility, leaving all of the core architecture and intended algorithmic design unchanged.

On IRSTD-1K, CG-IRNet achieves 65.09 mIoU, 65.13 nIoU, 93.27% Pd, and 30.992 × 10⁻⁶ Fa. Compared with SCTransNet, which reaches the same detection probability, the proposed model shows a clear reduction in the false alarm rate (30.992 vs. 55.74). In practice, this suggests that the confidence-guided suppression mechanism helps filter out background-induced activations without sacrificing detection sensitivity. When compared to earlier CNN-based approaches such as ACM and ALCNet, CG-IRNet also tends to produce more structurally consistent predictions while reducing isolated false responses.

On NUAA-SIRST, CG-IRNet achieves 74.62 mIoU, 74.65 nIoU, 96.95% Pd, and 59.477 × 10⁻⁶ Fa. The relatively strong Pd across this dataset indicates that the model remains sensitive to weak targets under varying background conditions. At the same time, the false alarm rate stays within a comparable range, suggesting that the confidence-aware SLS formulation does not overly suppress valid detections.

On NUDT-SIRST, the proposed model reaches 92.27 mIoU, 92.30 nIoU, 98.62% Pd, and 12.662 × 10⁻⁶ Fa. Given the more complex and cluttered backgrounds in this dataset, the low false alarm rate is particularly noticeable. This behavior is consistent with the design of the multi-target structure-aware optimization, which appears to improve robustness against structured background interference.

Overall, as shown in Table 2, the proposed CG-IRNet maintains a relatively stable balance between detection probability and false alarm rate across all three datasets, rather than optimizing for a single metric.

5.2. Complexity Comparison Analysis

As shown in Table 3, computational complexity is reported for models that can be executed under the same PyTorch under the ROCm environment. Parameter counts and MACs are computed using dummy inputs with 1×[#channel]×256×256 input size. Inference latency is measured as model-only forward time with batch size 1, FP32 precision, 50 warm-up iterations, and 500 measured iterations on an AMD Radeon RX 7900 XT under PyTorch under ROCm. Data loading, metric computation, image saving, and connected-component post-processing are excluded from timing.

ACM and ALCNet are excluded from the unified runtime table because their official implementations are based on MXNet/Gluon and require an older CUDA/cuDNN software stack, making their inference latency not directly comparable with the PyTorch under ROCm models. PSTNN is a traditional iterative optimization method without trainable neural parameters or a fixed neural forward graph, so parameter and MAC reporting is not applicable. These methods are included in the accuracy comparison in Table 2.

5.3. ROC Curve Analysis

To further examine the trade-off between detection performance and false alarms, ROC curves (Pd vs. Fa × 10⁻⁶) are plotted for each dataset, as shown in Figure 3a–c.

On IRSTD-1K, CG-IRNet shows a clear advantage in the low-Fa region (<60 × 10⁻⁶), where it is able to maintain a relatively high Pd while suppressing background responses. In this strict operating range, the curve generally stays above most competing methods. This behavior is particularly relevant for practical scenarios, where false alarms often become the limiting factor.

On NUAA-SIRST, the difference becomes more noticeable in the mid-to-low Fa range. Under similar false alarm levels, CG-IRNet tends to achieve higher detection probability than most baselines. This suggests that the SCH loss contributes to more consistent localization, especially when the background conditions vary.

On NUDT-SIRST, CG-IRNet reaches near-saturation Pd relatively quickly while still keeping Fa at a low level. The shape of the curve indicates strong separation between target and background responses, which is consistent with the effect of the confidence-guided gating in suppressing spurious activations.

Across all three datasets, the ROC curves show a similar pattern: the proposed SCH loss together with the confidence-aware design improves the operating behavior of the detector, especially under low-false-alarm constraints.

5.4. Qualitative Analysis of False Alarm Suppression

To better understand how the proposed confidence-guided design behaves in practice, we perform a qualitative comparison between CG-IRNet and two representative deep learning detectors, SCTransNet and UIU-Net, on several challenging scenes from the IRSTD-1K dataset. The results are shown in Figure 4. The figure includes four representative samples from the testing subset (XDU925, XDU711, XDU447, and XDU219), where model predictions are displayed alongside the input images and ground-truth annotations. These samples are selected mainly because they exhibit noticeable differences in false alarms and missed detections across methods.

In the XDU447 example, one annotated target (marked by a dashed circle) appears visually ambiguous with respect to its surrounding background. Both SCTransNet and UIU-Net respond to this region as a target, while CG-IRNet suppresses the activation. This behavior reflects the more conservative decision tendency introduced by the confidence-aware suppression, although it may also lead to a potential trade-off in borderline cases.

More generally, SCTransNet and UIU-Net show different tendencies when dealing with challenging IRSTD scenes. SCTransNet tends to detect more weak targets and often achieves fewer missed detections among the compared models. However, this sensitivity is sometimes accompanied by additional false alarms, especially in regions with strong structural patterns, as indicated by the purple circles in Figure 4.

On the other hand, UIU-Net behaves in a more conservative manner. In the selected examples, it suppresses most background responses and produces very few false alarms. At the same time, this behavior can lead to missed detections when the target signal is weak or partially occluded. These missed targets are highlighted by the yellow circles in the figure.

Compared with these two approaches, CG-IRNet shows more balanced behavior. In the illustrated scenes, it is able to suppress most background-induced false alarms while still preserving correct detections. Although a small number of targets may still be missed in particularly difficult cases, the overall trade-off between false alarms and missed detections appears improved relative to both SCTransNet and UIU-Net.

This observation is consistent with the quantitative results reported earlier. As shown in Table 1 and the ROC curves in Figure 3, CG-IRNet achieves detection probability comparable to SCTransNet while maintaining a noticeably lower false alarm rate. This improved balance is likely due to the combined effect of frequency-aware contrast refinement and the confidence-aware prediction head, which together reduce overconfident background responses while preserving reliable target activations.

5.5. Ablation Study

To examine how each component contributes to the overall performance, a feature ablation study is conducted on IRSTD-1K, with the results summarized in Table 4.

From the full configuration, which includes SLS, frequency-enhanced LCE, and the confidence head, the model achieves a relatively balanced performance across mIoU, nIoU, Pd, and Fa. When the confidence head is removed, the false alarm rate increases noticeably, suggesting that the learned confidence map plays an important role in suppressing background-induced responses.

Similarly, removing the structural component in the SCH loss leads to a drop in IoU-related metrics, which indicates that centroid-aware multi-target supervision helps maintain structural consistency in the predictions. Disabling frequency-aware feature refinement (FAFR) also results in a slight decrease in Pd under cluttered conditions, implying that the use of multi-frequency information contributes to better target–background separation.

Overall, although each component affects performance in a different way, the results suggest that they work together to produce a more stable balance between detection accuracy and false alarm suppression.

5.6. Hyperparameter Sensitivity

To evaluate the robustness of the proposed training strategy, a hyperparameter sensitivity study in a one-factor-at-a-time (OFAT) style is carried out. In each experiment, a single parameter is varied while all other parameters remain fixed at the configuration used in the main experiments. The evaluated parameters include the mixing ratio in the second training phase for SCH loss, the localization weight λ_loc in the SLS formulation, the non-maximum suppression kernel size, the predicted region selection parameter top-k, and the binarization threshold used during evaluation. The results are summarized in Figure 5, where Figure 5a reports the best mIoU achieved under each configuration and Figure 5b shows the corresponding detection probability (Pd).

From Figure 5a, the SCH loss mixing ratio demonstrates moderate influence on segmentation quality. Among the tested settings, the balanced setting (0.5:0.5) gives the highest mIoU, while configurations dominated by BCE (1:0) or SLS (0:1) show a noticeable decrease in performance. This observation suggests that combining pixel-wise supervision and structure-aware localization objectives yields better segmentation accuracy than relying on either component alone.

The localization weighting parameter λ_loc exhibits a monotonic improvement trend within the explored range. Increasing λ_loc from 0.1 to 1 steadily improves mIoU, with the highest value observed at λ_loc = 1. This phenomenon indicates that stronger emphasis on localization-aware penalties helps refine target regions and suppress background responses.

For the SLS non-maximum suppression kernel, a smaller kernel size (5) yields slightly better performance than the larger setting (9), suggesting that overly aggressive spatial suppression may remove useful local responses around small targets. In contrast, the predicted-region selection parameter top-k shows the most noticeable effect on segmentation accuracy. A smaller selection value (top-k = 5) produces the highest mIoU among all tested configurations, whereas increasing it to 20 leads to a visible performance drop, indicating a point at which including too many candidate regions may introduce background noise during structural loss computation.

The evaluation threshold also has a noticeable impact on performance. As shown in Figure 5a, using an intermediate threshold (0.1) gives the best mIoU, while very high thresholds (e.g., 0.9) lead to a clear drop in segmentation accuracy, mainly due to missed detections.

The Pd curves in Figure 5b show a slightly different trend and highlight the trade-off between detection sensitivity and structural constraints. The highest detection probability is obtained when the threshold is set to 0.01, which makes the model more sensitive to faint targets. However, this also tends to increase background responses. As a result, slightly higher thresholds can provide a better balance when considering overall segmentation quality, especially in terms of mIoU.

These behaviors are generally consistent with the ROC analysis discussed earlier. In particular, configurations that favor high sensitivity (such as very low thresholds) tend to improve Pd but at the cost of more false alarms. In contrast, more balanced settings produce more stable segmentation results and better operating points along the Pd–Fa trade-off curve.

Overall, the sensitivity study suggests that CG-IRNet behaves relatively consistently across a range of parameter settings. Moderate changes in SCH-related parameters, such as the mixing ratio and SLS-related terms, only lead to small variations in Pd and mIoU. This indicates that the proposed framework does not rely heavily on precise tuning and remains stable under reasonable hyperparameter choices.

5.7. α. Sensitivity

To analyze the influence of the confidence gate parameter α, a sensitivity study is conducted around this parameter on IRSTD-1K. The parameter α controls the lower bound of the confidence gate; smaller α imposes stronger suppression, and larger α weakens the suppression effect from the gate. The suppression effect diminishes completely when α is set to 1.0, where the gate then is set to g = 1, leading to no effect on the final probability map.

As shown in Table 5, increasing

α

generally improves mIoU and nIoU slightly but also increases the false alarm rate. Fa increases from 30.3468 × 10⁻⁶ at α = 0.0 to 54.5824 × 10⁻⁶ at α = 1.0. This confirms that the confidence gate directly controls the conservativeness of the detector. We select α = 0.2 as the default setting because it maintains the same Pd as lower

α

settings while keeping Fa close to the lowest range. This setting provides a practical low-false-alarm operating point without causing a large loss in segmentation quality.

Also, in terms of the absolute values of this table, the sensitivity analysis was conducted as an independent follow-up run using the same evaluation protocol but a different training seed and checkpoint from the main benchmark table. Therefore, the absolute values may differ slightly from Table 2; the purpose is to compare the relative effect of α.

5.8. Failure Cases and Limitations

Figure 6 summarizes representative failure cases observed in CG-IRNet. In the missed weak target case, the target response is close to the surrounding background noise, and the model suppresses the activation. In the ambiguous target case, the model behaves conservatively when the target-like response is not sufficiently reliable, which helps reduce false alarms but can also remove borderline true targets. In the structured background case, strong local edges or cluttered background regions can still produce residual false responses. In the multi-target case, the model detects the more salient target responses but may get confused with boundaries for nearby weaker targets, indicating that multi-instance scenes remain challenging when target contrast varies strongly.

These examples show that the proposed confidence-guided design introduces an explicit trade-off between false alarm suppression and maximum sensitivity. CG-IRNet is intentionally biased toward conservative prediction under uncertain conditions, which improves low-Fa behavior but may suppress extremely faint or ambiguous targets. Future work will explore adaptive confidence calibration, temporal cues, and target-aware post-processing to improve robustness in these borderline cases.

6. Conclusions

In this paper, the proposed CG-IRNet, a structure-aware framework for infrared small target detection built around the proposed Structure–Confidence Hybrid (SCH) loss, is illustrated. By combining centroid-based structural supervision with a confidence-guided background suppression mechanism, the model improves detection consistency while reducing false alarms in practice.

Experimental results on IRSTD-1K, NUAA-SIRST, and NUDT-SIRST show that CG-IRNet is able to maintain a relatively good balance between detection probability and false alarm rate, especially in low-Fa operating regions. The ablation study also suggests that both the multi-target SLS formulation and the confidence branch contribute to overall performance improvements.

Overall, CG-IRNet provides more reliable detection behavior against complex background conditions. The combination of structure-aware supervision and confidence-guided optimization may also be useful for other dense prediction tasks where controlling false responses is important.

Author Contributions

Z.Z. and M.L.: funding acquisition, investigation, methodology, project administration, resources, software; Z.Z. and M.L.: supervision, validation, visualization, writing—original draft, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, Y.; Lai, X.; Xia, Y.; Zhou, J. Infrared Dim Small Target Detection Networks: A Review. Sensors 2024, 24, 3885. [Google Scholar] [CrossRef] [PubMed]
Zhou, F.; Wu, Y.; Dai, Y.; Ni, K. Robust Infrared Small Target Detection via Jointly Sparse Constraint of l1/2-Metric and Dual-Graph Regularization. Remote Sens. 2020, 12, 1963. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H.; Fakhri, G.; Liu, X.; Lu, S. Rethinking Evaluation of Infrared Small Target Detection. arXiv 2025, arXiv:2509.16888. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of New Top-Hat Transformation and the Application for Infrared Dim Small Target Detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale Patch-Based Contrast Measure for Small Infrared Target Detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 950–959. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Ying, X.; Liu, L.; Wang, Y.; Li, R.; Chen, N.; Lin, Z.; Sheng, W.; Zhou, S. Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection with Single Point Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 15528–15538. [Google Scholar]
Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. SCTransNet: Spatial–Channel Cross Transformer Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Wu, H.; Huang, X.; He, C.; Xiao, H.; Luo, S. Infrared Small Target Detection With Swin Transformer-Based Multiscale Atrous Spatial Pyramid Pooling Network. IEEE Trans. Instrum. Meas. 2024, 74, 1–14. [Google Scholar] [CrossRef]
Zhang, M.; Xu, Q.; Wang, Y.; Li, X.; Yuan, H. MIRSAM: Multimodal vision-language segment anything model for infrared small target detection. Vis. Intell. 2025, 3, 4. [Google Scholar] [CrossRef]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Ma, Q.; Deng, S.; Li, B.; Zhu, Z.; Song, Z.; Li, X.; Hu, H. DWTFreqNet: Infrared Small Target Detection via Wavelet-Driven Frequency Matching and Saliency-Difference Optimization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5007815. [Google Scholar] [CrossRef]
Li, Y.; Kang, W.; Zhao, W.; Liu, X. MLEDNet: Multi-Directional Learnable Edge Information-Assisted Dense Nested Network for Infrared Small Target Detection. Electronics 2025, 14, 3547. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared Small Target Detection with Scale and Location Sensitivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Mehrtash, A.; Wells, W.M., III; Tempany, C.; Abolmaesumi, P.; Kapur, T. Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation. arXiv 2019, arXiv:1911.13273. [Google Scholar] [CrossRef] [PubMed]
Dolezal, J.M.; Srisuwananukorn, A.; Karpeyev, D.; Ramesh, S.; Kochanny, S.; Cody, B.; Mansfield, A.S.; Rakshit, S.; Bansal, R.; Bois, M.C.; et al. Uncertainty-informed Deep Learning Models Enable High-Confidence Predictions for Digital Histopathology. Nat. Commun. 2022, 13, 6572. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Zhang, H.; Zhang, X.; Zheng, X. A Gradient-Compensated Feature Learning Network for Infrared Small Target Detection. Electronics 2026, 15, 868. [Google Scholar] [CrossRef]
Zhang, Z.; Yin, S. MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD. Electronics 2025, 14, 4527. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-Stage Cascade Refinement Networks for Infrared Small Target Detection. arXiv 2022, arXiv:2212.08472. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed CG-IRNet framework. The network follows a multi-scale encoder–decoder structure with cross-scale semantic interaction. Local contrast enhancement and frequency-aware feature refinement are inserted into selected feature stages to strengthen small-target saliency under cluttered backgrounds. A separate confidence head estimates the reliability of predicted responses. The confidence map is used to form a gate, which modulates the final probability map and suppresses uncertain background activations.

Figure 2. Structure of the proposed Structure–Confidence Hybrid (SCH) loss. The final objective combines weighted BCE supervision with an augmented SLS branch. The SLS branch uses connected component decomposition and instance-wise target matching to support multi-target scenes, while the confidence branch penalizes high-confidence responses on background pixels. By jointly optimizing pixel-wise discrimination, structural localization, and confidence-aware background suppression, SCH loss encourages accurate target localization and conservative prediction behavior.

Figure 3. ROC curve of different models on (a) IRSTD-1k, (b) NUAA-SIRST, and (c) NUDT-SIRST.

Figure 4. Predicted binary maps on samples XDU925, XDU711, XDU447, and XDU219 by SCTransNet, UIU-Net, and CG-IRNet.

Figure 5. Hyperparameter sensitivity analysis of CG-IRNet on IRSTD-1K. (a) Best mIoU obtained in hyperparameter sensitivity analysis for CG-IRNet. (b) Best Pd obtained in hyperparameter sensitivity analysis for CG-IRNet.

Figure 6. Representative failure cases of CG-IRNet on IRSTD-1K. Each row shows the input image, ground-truth, prediction probability map, binary prediction, and overlay. The red dashed circles highlight the missed target or false alarm, and the green dashed circle highlights the correct target. The selected cases illustrate four common failure modes: missed weak targets (XDU203), over-conservative suppression of ambiguous targets (XDU103), partial misses in multi-target scenes (XDU406), and residual false alarms caused by structured background clutter (XDU733).

Table 1. Learned wavelet band scaling statistics of the FAFR module on IRSTD-1K. The scaling factors are initialized to 1.0 and optimized during training.

Band	Mean	Std	Min	Max	Median
LL	1.1120	0.0854	0.8979	1.3323	1.1169
LH	0.9454	0.0804	0.7242	1.1458	0.9282
HL	0.9438	0.0750	0.7928	1.0893	0.9304
HH	0.8895	0.0716	0.7238	1.0281	0.8952

Table 2. Comparison with SOTA methods on IRSTD-1K, NUAA-SIRST, and NUDT-SIRST. Best results across all models for each metric are underlined and bolded.

Model	IRSTD-1K				NUAA-SIRST				NUDT-SIRST
Model	mIoU	nIoU	Pd	Fa_x1e6	mIoU	nIoU	Pd	Fa_x1e6	mIoU	nIoU	Pd	Fa_x1e6
Max-Median	6.85	2.99	65.21	239.16	4.08	12.05	69.2	221.54	4.11	3.6	58.41	147.71
Top-Hat	9.85	7.28	75.11	5733.73	6.99	17.89	79.84	4052.05	20.28	28.37	78.41	667.47
PSTNN	24.05	17.55	71.99	141.18	29.66	32.96	72.8	196.16	14.54	23.08	66.13	176.86
RDIAN	55.26	58.47	88.55	106.63	67.28	73.81	93.54	173.33	74.68	77.48	95.77	138.38
ACM	57.99	55.83	93.27	261.38	67.48	67.73	91.63	60.98	59.84	63.05	93.12	221.1
ALCNet	59.33	55.94	92.98	235.44	69.34	69.56	94.3	144.74	63.38	65.79	94.18	138.58
DNA-Net	64.52	64.99	90.91	49.01	74.21	77.54	95.82	35.16	86.34	86.72	98.83	36.04
MTU-Net	64.72	61.91	93.27	147.35	73.21	76.63	93.54	89.53	73.28	75.91	93.97	187.99
UIU-Net	64.76	65.26	93.98	88.37	75.29	78.31	95.82	56.58	91.52	91.92	98.31	31.19
AGPCNet	64.9	63.86	92.83	52.53	74.1	74.99	96.48	60.02	87	88.74	97.2	40.12
ISTDU	64.97	62.52	93.6	212.61	73.93	78.06	96.58	58.22	87.67	88.58	97.67	53.81
SCTransNet	66.6	66.72	93.27	43	75.87	79.38	96.95	55.74	92.11	92.4	98.62	17.18
CG-IRNet	65.09	65.13	93.27	30.992	74.62	74.65	96.95	59.477	92.27	92.3	98.62	12.662

Table 3. Model complexity analysis of parameter sizes and inference time.

Model	Params/M	Trainable Params/M	MACs/G (THOP)	Time ms/img	FPS	Precision
RDIAN	0.2168	0.2166	3.7183	3.4442	290.3402	fp32
ISTDU-Net	2.7611	2.7611	8.5536	8.9638	111.5597	fp32
DNA-Net	4.6969	4.6969	14.2822	34.0017	29.4103	fp32
MTU-Net	12.7506	12.7506	6.2168	4.3678	228.9491	fp32
SCTransNet	11.3259	11.3259	10.1186	21.2289	47.1056	fp32
CG-IRNet	11.3457	11.3456	10.9317	33.4142	29.9274	fp32
AGPCNet	12.3605	12.3605	43.1808	51.8265	19.2952	fp32
UIU-Net	50.5415	50.5415	54.5011	33.3090	30.0219	fp32

Table 4. Feature ablation study on CG-IRNet. Best results across all tested ablations for each metric are bolded and underlined.

SLS	WAV	LCE	CONF	mIoU	nIoU	Pd	Fa_x1e6
1	1	1	1	75.21	78.36	96.20	55.09
0	1	1	1	75.18	77.29	95.82	51.86
0	0	1	1	73.03	75.21	95.44	94.74
0	0	0	1	75.21	76.15	95.82	86.85
0	0	0	0	74.90	77.42	96.96	89.39

Table 5. Sensitivity analysis of the confidence gate parameter α on IRSTD-1K.

Alpha	Threshold	IoU	mIoU	nIoU	F1	Pd	Fa_x1e6	PixAcc
0.00	0.50	0.6428	64.2782	63.3116	78.2553	92.9293	30.3468	99.9891
0.10	0.50	0.6453	64.5308	63.6289	78.4422	92.9293	32.2067	99.9891
0.20	0.50	0.6490	64.8993	64.1749	78.7138	92.9293	34.1425	99.9891
0.30	0.50	0.6519	65.1944	64.5072	78.9305	92.9293	36.5528	99.9891
0.50	0.50	0.6547	65.4704	64.7560	79.1324	92.9293	42.5120	99.9890
0.70	0.50	0.6559	65.5939	64.9848	79.2226	93.2660	47.9020	99.9889
1.00	0.50	0.6548	65.4796	64.8429	79.1392	93.2660	54.5824	99.9886

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Z.; Liao, M. CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection. Electronics 2026, 15, 2405. https://doi.org/10.3390/electronics15112405

AMA Style

Zhu Z, Liao M. CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection. Electronics. 2026; 15(11):2405. https://doi.org/10.3390/electronics15112405

Chicago/Turabian Style

Zhu, Ziwen, and Mengmeng Liao. 2026. "CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection" Electronics 15, no. 11: 2405. https://doi.org/10.3390/electronics15112405

APA Style

Zhu, Z., & Liao, M. (2026). CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection. Electronics, 15(11), 2405. https://doi.org/10.3390/electronics15112405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CG-IRNet: Structure–Confidence Hybrid Learning for Low-False-Alarm Infrared Small Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Infrared Small Target Detection

2.2. Local Contrast Enhancement and Feature Refinement

2.3. Loss Functions and Confidence-Aware Learning

3. Methodology

3.1. Overall Architecture of CG-IRNet

3.2. Multi-Scale Encoder–Decoder Backbone

3.3. Cross-Scale Semantic Interaction

3.4. Frequency-Aware Feature Refinement (FAFR)

3.5. Confidence-Aware Prediction Head

3.6. Training Strategy and Loss Design

4. Experimental Setup

4.1. Datasets

4.2. Implementation Details

4.3. Training Strategy

4.4. Evaluation Protocol

4.5. Evaluation Metrics

4.5.1. Pixel-Level Metrics

4.5.2. Object-Level Metrics

4.6. Multi-Model Comparison and Reporting

5. Results

5.1. Quantitative Comparison with Other Methods

5.2. Complexity Comparison Analysis

5.3. ROC Curve Analysis

5.4. Qualitative Analysis of False Alarm Suppression

5.5. Ablation Study

5.6. Hyperparameter Sensitivity

5.7. α. Sensitivity

5.8. Failure Cases and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI