This section begins by outlining the experimental settings, covering the datasets, implementation details, and evaluation metrics employed. Subsequently, a comprehensive comparison, which encompasses both quantitative and qualitative analyses, is conducted between the proposed GIDNet and existing methods. Finally, ablation studies are performed to verify the effectiveness of each core module within the network.
3.2. Evaluation Metrics
To quantitatively assess detection capability, three widely used criteria in IRSTD are adopted, namely the intersection over union (
), probability of detection (
), and false alarm rate (
) [
4,
5,
21,
25]. In the following table, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Among them,
characterizes the algorithm’s ability to preserve target morphology at the pixel level, with particular emphasis on the accuracy of boundary delineation [
26]. Specifically, it is calculated as the ratio of the overlapping region between the
P and
G annotations to their combined area:
where
,
, and
represent the numbers of true-positive, false-positive, and false-negative pixels. A larger
value indicates a greater overlap between the
P and the
G, reflecting more accurate contour localization and boundary characterization of the target.
is a target-level indicator used to quantify the proportion of successfully detected targets among all annotated targets [
23,
24]. In this study, a prediction is regarded as correct when the Euclidean distance between the centroid of
P and that of the
G is smaller than a predefined threshold
T. Following [
22,
25,
43],
T is fixed at 3 pixels in all experiments. A prediction is considered correct when the distance between its center and the ground truth is smaller than a threshold
T. Since all input images are resized to
pixels before testing, this fixed threshold remains stable and fair for our evaluation. However, in practical applications, the sizes of images from cameras often change. A fixed threshold of 3 pixels may fail to apply to these changing sizes. To solve this problem, future studies could design a dynamic threshold that adjusts according to the input image size.
where
denotes the volume of targets that satisfy the centroid-matching criterion, whereas
represents the total number of
G in the dataset. A higher
value indicates superior detection sensitivity and demonstrates a stronger ability of the algorithm to identify tiny targets embedded in complex backgrounds.
is defined as the proportion of
pixels to the total number of pixels in the dataset. This metric is commonly used to reflect the extent to which an algorithm is affected by background interference and noise contamination [
5,
34].
where
represents the total pixel area of predicted regions that do not match any ground truth target (i.e., false alarms), and
represents the total pixel count of the dataset. Notably, a lower
is indicative of better performance, demonstrating the model’s robustness in effectively suppressing background artifacts and minimizing the occurrence of erroneous detections.
Additionally, the receiver operating characteristic (
) curve is employed to assess detection behavior and model robustness under varying decision thresholds [
44]. This curve depicts the relationship between the true positive rate (
) and the false positive rate (
), thereby providing a comprehensive view of the performance variation induced by threshold adjustment. By continuously changing the confidence threshold of the prediction results, a series of
points can be obtained to form the
curve:
In general, better detection performance is indicated by a curve located closer to the upper-left corner, which corresponds to a higher achieved at a lower .
3.3. Quantitative Comparison
We compared GIDNet with 16 representative methods, which include traditional methods, CNN-based methods, and hybrid CNN models. For the traditional methods, we selected eight well-established algorithms. These include filter-based methods such as Top-Hat [
7], and local contrast-based methods like TLLCM [
11] and WSLCM [
15]. We also evaluated several patch-image-based models, including IPI [
16], RIPT [
17], NRAM [
19], PSTNN [
18], and MSLSTIPT [
20]. To assess the performance of our proposed method, we conducted a comprehensive comparison with seven representative CNN architectures, focusing primarily on CNN-based paradigms. Specifically, we selected foundational models such as UIU-Net [
26] and MSHNet [
6], along with several recent high-performance networks, including MMLNet [
45], HDNet [
46], SDS-Net [
43], and SCTransNet [
27]. Additionally, for L2SKNet [
47,
48], both of its distinct architectural variants (L2SKNet-Unet and L2SKNet-FPN) were included in our evaluation. The methods are categorized as follows: Trad-F (filter-based traditional methods), Trad-C (local contrast-based traditional methods), Trad-L (low-rank-based traditional methods), CNN, and CNN-T (hybrid CNN and transformer-based methods).
Table 1 summarizes the quantitative evaluation on the IRSTD-1K dataset, where our GIDNet outperforms existing methods across all metrics. As observed from the data, such as Top-Hat, IPI, and the more recent MSLSTIPT generally exhibit suboptimal performance. These methods struggle to balance a higher
with a lower
, largely because hand-crafted features lack the representative power to distinguish dim targets from complex, cluttered backgrounds. For instance, while IPI achieves a relatively low
of
, its
remains stagnant at
, which is significantly lower than that of CNN competitors.
In contrast, CNN architectures demonstrate a substantial performance leap, underscoring the efficacy of deep feature representations. Among the recent SOTA models, SDS-Net and L2SKNet-Unet emerge as strong contenders. Specifically, SDS-Net achieves an impressive of 66.67% and a of 92.93%. However, our proposed GIDNet outperforms all compared methods across all three evaluation metrics. Notably, GIDNet achieves the highest of 69.01% and a of 93.54%, surpassing the second-best method (SDS-Net) by 2.34% and 0.61%. In addition, SCTransNet, a CNN-T network model, exhibited a remarkably low of 11.84%.
Furthermore, GIDNet demonstrates superior noise suppression capabilities, yielding the lowest of . This represents a noteworthy improvement over L2SKNet-Unet (), which previously held the leading record for noise suppression among the listed CNN models.
Table 2 summarizes the comparative results on the NUAA-SIRST, as indicated by the tabulated data, traditional methods such as Top-Hat and IPI, are observed to suffer from exceedingly high
> 10,000, whereby a significant limitation in suppressing complex background clutter is revealed. Although modest improvements in the
are achieved by methods like PSTNN and RIPT, their performance remains substantially inferior to the precision that is fundamentally required for reliable IRSTD.
In contrast, a decisive performance leap is demonstrated by CNN methods. Among the evaluated models, a remarkable of is achieved by the proposed GIDNet, by which the majority of recent SOTA models, including UIU-Net (), MMLNet (), and the recently proposed SDS-Net (), are outperformed. Notably, a perfect probability of detection () is attained solely by GIDNet, which marginally surpasses the closely competing HDNet (). By this unparalleled detection capability, the exceptional robustness of our model in identifying infrared targets without omissions, even under severely challenging conditions, is explicitly highlighted. However, SCTransNet, which belongs to the CNN-T network type, did not perform well on the NUAA-SIRST dataset.
Furthermore, while a slight superiority in is maintained by HDNet ( compared to our ), the lowest false alarm rate () among all comparable methods is exhibited by GIDNet. This marks a notable improvement over other high-performing architectures, such as HDNet (), L2SKNet-Unet (), and SDS-Net (). This superior balance, struck between high detection sensitivity and effective background suppression, strongly underscores the structural advantages of our proposed network. In summary, it is comprehensively validated by the quantitative results that an optimal trade-off between localization accuracy and target integrity is achieved by GIDNet, establishing it as a robust and efficient architecture for IRSTD.
Table 3 presents a comprehensive quantitative evaluation on the NUDT-SIRST. As demonstrated by the experimental results, CNN architectures consistently and significantly surpass traditional mathematical models across all three evaluation metrics. Traditional methods, such as MSLSTIPT and TLLCM, struggle severely with elevated
and remarkably low
scores, reflecting their limited robustness against complex background clutter and varying target scales.
Notably, our proposed GIDNet excels in suppression, achieving the optimal overall performance with an of strictly . This marks a discernible improvement even over the most recent and competitive baselines, such as HDNet () and L2SKNet-Unet (). Concurrently, GIDNet maintains a highly competitive at . While SDS-Net () and HDNet () report marginally higher detection probabilities, they inherently compromise by yielding higher . This indicates that GIDNet strikes a superior balance, effectively isolating true infrared targets without aggressively misclassifying background artifacts.
With regard to pixel-level shape description, GIDNet registers an of . Although several SOTA networks, notably MMLNet (), SCTransNet (), and L2SKNet-Unet (), exhibit superior segmentation completeness, the proposed network remains highly viable. In practical IRSTD scenarios, the operational priority is often heavily weighted toward absolute minimization of false alarms along with the reliable discovery of targets (), both of which are critical dimensions where GIDNet demonstrates exceptional and leading capability.
Although MMLNet achieves a superior IoU of
, our GIDNet’s lower score (
) stems from a strategic emphasis on background clutter suppression rather than pixel-level morphological fidelity. As evidenced in
Table 3, GIDNet achieves the most competitive false alarm rate (
) and a high detection probability (
). This suggests that while GIDNet slightly sacrifices boundary integrity due to aggressive feature filtering, it effectively minimizes false positives, making it more robust for practical infrared search and track systems where precision in detection outweighs the necessity for perfect target shape reconstruction.
To comprehensively evaluate the robustness and detection capability of the proposed method against various background clutter, we plot the
curves across three datasets, as depicted in
Figure 4. The
curves visually articulate the trade-off between the
and the
. It is worth noting that the x-axis (
) is strictly scaled down to the range of
to emphasize model performance under highly stringent
constraints, which is critical for practical IRSTD.
As shown in the graphical comparisons, GIDNet consistently demonstrates superior performance across all evaluated benchmarks. In particular, on the IRSTD-1k dataset, the proposed method exhibits a competitive trend, where GIDNet quickly surpasses baseline models as the increases. It ultimately reaches the highest , highlighting its excellent ability to preserve target detection capability while minimizing background interference.
On the NUAA-SIRST, GIDNet maintains remarkable consistency. While models like SDS-Net and UIU-Net produce strong early responses, GIDNet follows closely behind their state-of-the-art performance, positioning itself in the top-performing cluster and outperforming older models, such as MSHNet and the L2SKNet variants.
Notably, GIDNet’s excellence is particularly apparent on the NUDT-SIRST. At extremely low , GIDNet achieves a rapid rise in , surpassing all other networks in the early stages of the threshold. This rapid approach to near-perfect ensures highly sensitive responses to faint infrared targets, even in challenging environments.
In comparison, SCTransNet, known for its efficient CNN-T architecture, shows commendable performance as well, though it slightly trails behind GIDNet in terms of detection sensitivity at low . SCTransNet performs well under certain conditions but does not consistently outperform GIDNet across all datasets.
In summary, the quantitative analysis of the ROC curves consistently demonstrates that GIDNet achieves an optimal balance between target localization accuracy and minimizing false alarms (), thereby presenting a robust and effective framework for IRSTD tasks.
3.4. Qualitative Comparison
As illustrated in
Figure 5, we conduct a qualitative analysis to evaluate the performance of our GIDNet against SOTA methods, including SDS-Net, HDNet, MMLNet, L2SKNet, SCTransNet, MSHNet, and UIU-Net. Due to page constraints, these visual examples are drawn from three datasets, which cover various challenging scenarios.
Upon examining the original IRIs in the first column, two primary challenges for IRSTD are evident. First, the targets exhibit extremely small sizes and diverse morphologies. For example, in the first and second rows, the targets occupy only a few pixels and are buried in significant noise. In the fourth row, multiple targets of varying sizes appear simultaneously. Existing methods like UIU-Net, MSHNet, SDS-Net, and L2SKNet-FPN occasionally suffer from missed detections (marked by blue boxes) or incomplete shape segmentation when dealing with such multi-scale targets.
Second, the presence of complex background clutter and low SNR poses a considerable risk of false alarms. Observing the fifth and sixth rows, which feature building structures and heavy cloud interference, most comparative methods like SDS-Net, HDNet, MMLNet, L2SKNet, MSHNet, and UIU-Net struggle to differentiate real targets from high brightness background artifacts, resulting in numerous false alarms (marked by yellow boxes).
In contrast, our GIDNet demonstrates superior robustness and precision. GIDNet shows an exceptional ability to preserve the spatial integrity of diminutive targets while simultaneously mitigating the interference of pixel-level artifacts. As shown in the third and fifth rows, GIDNet achieves a much lower false alarm rate compared to HDNet and SDS-Net. Additionally, in multi-target scenarios in the fourth row, our model successfully identifies every target instance with precise boundaries, demonstrating its powerful spatial context modeling capabilities. In summary, GIDNet excels in both target highlighting and background clutter suppression, producing results that are most consistent with the G.
3.5. Ablation Experiments
The incremental performance gains resulting from the proposed components are summarized in
Table 4 and
Table 5. To assess the effectiveness of the proposed architecture, we evaluate the individual contributions of the GISC and SFP modules.
As reported in the
Table 4, the baseline model exhibits suboptimal performance, yielding an
of 60.78% and a
of 85.8%. Upon integration of the GISC module, a substantial performance gain is observed, with
escalating to 68.18% and
being significantly suppressed to 11.32%. Similarly, the independent contribution of the SFP component is demonstrated, through which
is increased to 63.73%. Ultimately, the synergistic effect of both modules is demonstrated by the complete GIDNet architecture, which achieves the superior metrics across all categories, specifically reaching a peak
of 69.01% and a minimum
of 10.32%.
The generalizability of the proposed framework is further validated on the NUAA-SIRST. The baseline configuration achieves an of 70.86% with a of 27.99%. With the inclusion of the GISC component, the model’s discriminative capability is notably strengthened, demonstrated by the increasing to 78.05% and the reaching 99.07%. The optimal performance is consistently delivered by GIDNet, which achieves a perfect of 100% and a negligible of 6.74%. These results reinforce the conclusion that the integration of GISC and SFP is vital for high-precision infrared small target detection.
To investigate the influence of the hyperparameter
on model performance, we conduct a sensitivity analysis by varying its value.
Table 6 presents quantitative results from the internal ablation experiments on the
parameter within the GISC module of GIDNet, conducted on the IRSTD-1k. The threshold parameter
is varied from
to 0. To evaluate the performance, three metrics are employed:
,
, and
. Notably, as
decreased from
to
, a favorable trend in detection performance is observed, characterized by a steady increase in both
and
alongside a concurrent reduction in
.
However, further decreasing from to 0 results in a performance degradation, characterized by a reduction in both and , accompanied by a deterioration in the score. Consequently, is identified as the optimal threshold, yielding superior results across all metrics. Specifically, at this setting, the model achieves an of , a of , and a minimized of . Based on these empirical findings, we select as the final value for our proposed framework.