1. Introduction
Infrared Search and Tracking (IRST) systems play a crucial role in various civilian and military applications [
1]. They are widely utilized for tasks like pinpointing heat sources during firefighting operations and detecting abnormalities in medical applications [
2,
3]. These systems make use of emitted or reflected infrared radiation from objects to accomplish target detection, tracking, and identification [
4,
5]. Their capacity to identify targets becomes particularly valuable in situations where visual identification is hindered or impractical. Examples include scenarios involving targets with camouflage, distant targets, or challenging weather conditions [
6].
Detecting infrared small targets presents a significant challenge in IRST applications. These targets are characterized by their small size (typically less than 9 × 9 pixels or constituting less than 0.15% of the field of view) and lack of detailed texture and shape information [
7]. Consequently, distinguishing small infrared targets from complex backgrounds becomes challenging, as these backgrounds often contain elements (such as complex terrains, man-made structures, and clouds) that reflect solar radiation in patterns resembling the targets [
8,
9,
10]. As shown in
Figure 1, as the complexity of the scenes increases, an increasing number of false alarm sources emerge, leading to reduced saliency of the targets.
Numerous algorithms have been developed for infrared small target detection (ISTD). These algorithms can be broadly grouped into two categories: multi-frame and single-frame methods [
11].
Multi-frame methods detect targets by leveraging the relative motion between targets and background across frames in a sequence [
12,
13,
14]. They assume a relatively static background and require the accumulation of information from multiple frames to determine the targets’ location. The advantage of multi-frame methods is that temporal information is used to enhance detection; however, these methods have some limitations: (1) multi-frame methods naturally do not apply to single-frame scenarios; (2) their effectiveness is hindered when the relative motion assumption is not met; (3) in practical applications, there is a strong demand for fast detection using as few frames as possible. Moreover, multi-frame detection can be approached by detecting targets in each frame using single-frame methods and, then, analyzing the trajectories of the detected targets [
15]. Therefore, studying single-frame detection methods is prevalent, and we focus on the single-frame method in this paper.
Single-frame detection methods can be grouped into two categories: non-deep learning and deep learning methods. Non-deep learning methods can be further divided into background-suppression methods, target enhancement methods, image structure-based methods, and classifier-based methods. Background-suppression methods segment target regions by subtracting an estimated background from the input image [
16,
17,
18]. Target-enhancement methods employ calculations of local contrast and saliency to search for or amplify the target regions [
19,
20,
21,
22,
23,
24]. Image-structure-based methods assume a mathematical model of low-rank background and sparse targets and solve for the target region through optimization techniques [
25,
26,
27,
28]. Classifier-based methods are typically combinations of candidate extraction, feature extraction, and feature classification [
29,
30]. These non-deep learning methods use interpretable mathematical models to model the target and background; however, their modeling heavily relies on prior knowledge, so their effectiveness is often compromised in complex scenes, as manually designing constraints and features to accurately discern between targets and false alarms can become problematic.
Deep learning methods automate the process of extracting features for targets using deep neural networks (DNNs). Among them, target-detection networks estimate bounding boxes to approximate the location of small targets [
31,
32,
33,
34], while target-segmentation networks address target detection as a semantic segmentation problem [
35,
36,
37,
38,
39], aiming to accurately identify the location of small targets at the pixel level. The target-segmentation approach is gaining popularity in research due to its ability to more precisely identify regions containing small targets [
7], indicating a promising direction for future progress. Although deep learning methods have achieved impressive progress in ISTD [
40,
41,
42,
43,
44,
45], there are notable drawbacks. Firstly, existing methods mainly concentrate on modeling the targets, disregarding the modeling of false alarm sources, which also carry useful information, while in practical scenarios, the detection of infrared small targets against densely cluttered backgrounds requires a combined effort to eliminate false alarm sources and detect the targets in order to minimize the false alarm rate. Additionally, the opaque nature of DNNs is a significant limitation, for interpretability is crucial due to the fact that ISTD is a risk-sensitive task.
Taking into account the practical needs and the advantages and drawbacks of the aforementioned methods, the objectives of this research are to employ DNN techniques that make use of information from both targets and false alarm sources to achieve accurate and robust target segmentation, while maintaining interpretability.
To achieve these objectives, we present the Target and False Alarm Collaborative Detection Network for Infrared Imagery (TFCD-Net). Our network addresses the need to incorporate information from false alarm sources by utilizing specialized False Alarm Source Estimation Blocks (FEBs). In order to enhance interpretability, we have designed the overall framework of our network to resemble the background suppression process employed in non-deep learning methods. Specifically, the network first suppresses false alarm sources by subtracting the results of multiple FEBs from the input image and, then, proceeds to segment the targets using a Target Segment Block (TSB).
The major contributions of this research can be summarized as follows:
We propose a framework that effectively models both targets with TSB and false alarm sources with FEBs. This approach aims to address the challenges posed by complex and cluttered backgrounds while maintaining interpretability.
We introduce a dedicated FEB to estimate potential false alarm sources. By integrating multiple FEBs into the framework, false alarm sources are estimated and eliminated on multiple scales and in a blockwise manner. This block not only enhances the accuracy of our method, but also can serve as a preprocess module to suppress false alarm sources for other ISTD algorithms.
Extensive experiments on public datasets validated the effectiveness of our model compared to other state-of-the-art approaches. In addition to accurately detecting targets, our model can produce multi-scale false alarm source estimation results. These estimations are not just incidental outcomes; they can be used to generate false alarm source datasets that can contribute to further studies in the field.
4. Experiments
4.1. Settings
In the experiments, two public datasets were used: the NUAA-SIRST dataset [
35] and the NUDT-SIRST dataset [
44]. The NUAA-SIRST dataset consists of 427 images; the NUDT-SIRST dataset contains 1327 images. These images represent various common infrared scenes, such as clouds, sea surfaces, urban environments, and ground scenes. They are relevant for both terrestrial and aerial ISTD tasks. The resolution of all images in the datasets is
pixels. We evenly divided the datasets into training and testing sets, with a split ratio of 50%.
The performance of the algorithms and networks was evaluated using various metrics. The key evaluation metrics used were the probability of detection (), false alarm rate (), intersection over union (), and receiver operating characteristic (ROC). The definitions of these metrics are as follows.
Probability of detection (
): Measures the ability of detecting true targets on the target level. Defined as the ratio of the number of correctly detected targets over true targets as follows:
where
is the number of correctly detected targets and
is the number of true targets. A higher score indicates better detection capability.
False alarm rate (
): Measures the rate of detecting false targets on the pixel level. Defined as follows:
where
and
are pixel-level false positive and true negative detections, respectively. A lower score indicates fewer false positive detections.
Intersection over union (
): Measures the accuracy of detection on the pixel level. Defined as the area of overlap between the predicted and the ground truth targets divided by the area of their union as follows:
where
is the area of intersection and
is the area of union. Area is calculated as the number of pixels. A higher score indicates higher accuracy segmenting the targets.
F1-score: Measures the balanced score considering both precision and recall as follows:
where
,
, and
are the pixel-level true positive, false positive, and false negative detections, respectively. A higher score indicates higher overall detection performance.
Receiver operating characteristic (ROC): Measures the robustness of detection. Defined as the curve of the false positive rate (FPR) to the true positive rate (TPR). To improve the effectiveness of evaluation, a series of 3D ROC curve-derived evaluation metrics proposed in [
60,
61] were used. Among the metrics, higher area under the curve (AUC) scores (except the lower score for
) indicate better robustness.
For the model setup, the network structures displayed in
Figure 4 were employed in the ablation study; structure A3 in
Figure 4 was employed in the comparative experiments. Regarding the loss function, the coefficient
was set to 0 during the initial 50 training epochs and was adjusted to 0.2 thereafter. The optimization algorithm used was Adaptive Moment Estimation (ADAM). The initial learning rate was set at 0.001 and scheduled to decay by a factor of 0.1 every 50 epochs. The training regimen spanned 200 epochs with a batch size of 8.
Illustrative samples from the datasets are showcased in
Figure 5 and
Figure 6, providing visual context to the types of infrared images used in the experiments.
4.2. Ablation Experiments
The overall framework of the network was subjected to ablation testing, which included evaluating six different structural configurations, as shown in
Figure 4. Type-A networks are represented in a serial connection, composed of 0, 1, 2, or 3 FEBs coupled with a single TSB. Type-B networks feature a stair-step connection consisting of either 2 or 3 FEBs and a single TSB. The intent behind these experiments was to identify the optimal number of stages for background suppression and to validate the effectiveness of the stair-step connection approach.
The six aforementioned network structures were trained and tested on two datasets: NUAA-SIRST and NUDT-SIRST. The outcomes of these trials are presented in
Table 1.
The results outlined in
Table 1 indicate that employing a stair-step connection (Type-B networks) does not enhance network performance. Contrarily, an increase in the number of stages correlated with a decline in the performance metrics, a trend that was particularly pronounced with the NUDT-SIRST dataset. This dataset typically features smaller targets, and it is hypothesized that the Type-B network’s upsampling and downsampling procedures might introduce disturbances to the edge characteristics of these targets. Though Type-B structures have shown less effectiveness, we consider that retaining the records in this section could be a reference and beneficial for future studies.
For Network Type-A, incorporating two FEBs resulted in the best performance. Increasing the number of blocks beyond that did not lead to further enhancements in network capacity. Therefore, based on these findings, we chose the network structure A3 for comparative testing in this section.
4.3. False Alarm Source Estimation Capability
Experiments were carried out to validate the ability of the proposed model to estimate false alarm sources. To conduct a detailed analysis, the A4 framework was selected based on its performance in the ablation experiments. The various stages of the model’s input and output, depicted in
Figure 7, were examined. These stages include the network input, which corresponds to the original image (S1in), and the network output (S4out).
The experiments were performed on multiple scenes, with results depicted in
Figure 8,
Figure 9,
Figure 10 and
Figure 11.
Figure 8 and
Figure 10 show two dense clutter scenes from the NUDT-SIRST dataset. In these figures, the first row demonstrates the inputs of each stage, while the second row shows the corresponding outputs, with red circles indicating the targets. The 3D visualization of these scenes are, respectively, presented in
Figure 9 and
Figure 11.
From
Figure 8 and
Figure 10, it is evident that the false alarm sources were progressively suppressed from S1in to S4in; meanwhile, the target regions were preserved. This was achieved by the gradual suppression of non-target edge areas, making targets more salient and easier for the final TSB to process. The outputs from S1out to S3out correspond to the outputs from the FEBs. It can be observed that the suppression of false alarm sources progresses from finer to coarser details and from higher to lower frequencies. This suggests that each stage of the network estimates and suppresses the most salient non-target edge regions. By S3 stage, where most high-frequency regions have already been suppressed, the FEB begins estimating the low-frequency fluctuations in the background. It should be noted that the FEBs do not suppress the target area at any stage, as is evident from S1out, S2out, and S3out, thus ensuring the preservation and increasing saliency of the target. This characteristic is particularly evident in the images of S2in, S3in, and S4in depicted in
Figure 9.
4.4. Comparative Experiments
The proposed TFCD-Net was evaluated and compared with state-of-the-art methods, including the optimization-based methods NRAM [
46], the PSTNN [
47], and SRWS [
48], as well as the deep learning methods ACM [
35], ALCNet [
43], RDIAN [
40], UNet [
54], ISTDUNet [
45], and DNANet [
44].
For the non-deep learning methods, the original parameters from their respective publications were employed. The DNN models were trained using the ADAM optimizer with a batch size of 8, an initial learning rate of 0.001 for 200 epochs, and a learning rate decay by a factor of 0.1 every 50 epochs. The soft IoU loss was used for training the DNN models.
The performance of the proposed TFCD-Net and the comparison methods is demonstrated on six representative scenes from the NUAA-SIRST and NUDT-SIRST datasets. The output results and 3D visualizations for these scenes are depicted in
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16 and
Figure 17. It can be observed that, in scenes 1 to 5, the three non-deep learning methods failed to detect some targets, highlighting the insufficient stability of manually modeling in complex environments. This was particularly evident when the assumptions of sparsity for targets and low rank for backgrounds were not met, such as in heavily cluttered scenes and scenes with larger targets.
Figure 14 showcases a complex scene where most comparison methods failed to detect the target. While the ISTDUNet and DNANet models detected the target, they lacked precision in segmentation. In contrast, the proposed model was capable of accurately segmenting the target. In
Figure 16, larger targets posed a challenge for models like ACM and ALCNet, which produced imprecise contours.
In
Figure 17, RDIAN, UNet, ISTDUNet, and DNANet segmented the single target into multiple parts, potentially affecting precise localization and subsequent identification in practical applications; for instance, UNet’s detection was more than three pixels away from the true center of the target, which is significant given the small size of the targets. The proposed TFCD-Net performed well across all six scenes, especially on the more complex NUDT-SIRST dataset, as shown in
Figure 12,
Figure 13 and
Figure 14.
For a quantitative assessment,
Table 2 and
Table 3 present the comparative test results of the proposed TFCD-Net and other methods on the NUAA-SIRST and NUDT-SIRST datasets.
From
Table 2 and
Table 3, it can be observed that, for the
value, the proposed TFCD-Net achieved the highest score on the NUDT-SIRST dataset, and the second-best on the NUAA-SIRST dataset, comparable to the performance of DNANet, confirming its effectiveness in detecting small targets alongside the leading methods.
For the and scores, the proposed method achieved the highest on the NUDT-SIRST dataset and the second-best on the NUAA-SIRST dataset, second to RDIAN, which demonstrates a slightly more precise target segmentation. The RDIAN model, with its MPCM-inspired convolutional kernel design, showed limited generalizability on the NUDT-SIRST dataset, where the proposed TFCD-Net maintained its high performance.
For the
score, the proposed method outperformed all other deep learning methods on both datasets, demonstrating its superior capability in suppressing false alarms. It is important to note that the SRWS algorithm, while not a DNN-based approach, showed a significantly lower
score, which may be attributed to its lower
and
values, suggesting a tendency to output smaller target areas, which can also be inferred from the results in
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16 and
Figure 17.
The robustness of the methods was analyzed by plotting the 3D ROC-derived curves shown in
Figure 18 and
Figure 19 and calculating the AUC values as presented in
Table 4 and
Table 5. The proposed TFCD-Net achieved the best score on the NUDT-SIRST dataset and showed competitive AUC scores among DNN approaches on the NUAA-SIRST dataset. A low score for
and a high score for other AUC metrics suggest effective background suppression and strong target responses, which indicate the robustness of the proposed TFCD-Net.
Further analysis of
Table 2,
Table 3,
Table 4 and
Table 5 and
Figure 18 and
Figure 19 indicates that the performance of deep learning methods significantly exceeds that of non-deep learning methods in all metrics except the Fa value. Non-deep learning model-driven algorithms are constrained by the need to manually model small target features, applying constraints such as shape or sparsity and extracting specific components from images. While such designs do not rely on large datasets, they are limited in their scope due to their modeling bias. In contrast, DNN models learn features that minimize the loss function through the backpropagation algorithm with appropriate loss function settings. The proposed TFCD-Net, using FEBs for progressive false alarm suppression combined with a TSB for target segmentation, achieves correct target detection and precise segmentation across various conditions.
For the evaluation of the complexity,
Table 6 illustrates a comparison of DNN models on two datasets. The experiments were performed on an RTX 3090 GPU using Python. Our model exhibited a medium number of parameters compared to the other models. Notably, the inference speed of our model was 4.203 ms per image, and the training durations were 3.9463 s per epoch on the NUAA-SIRST dataset and 12.4757 s per epoch on the NUDT-SIRST dataset. Though our approach did not surpass the speed of ACM, ALCNet, and RDIAN, it was significantly faster than models such as UNet, ISTDUNet, and DNANet in both the training and inference times. These results show that our model achieves high performance with a comparatively modest parameter count, making it suitable for real-time applications due to its fast inference time.
5. Conclusions
In this paper, we introduce TFCD-Net for detecting small targets in infrared imagery. To reduce the false alarm rate, we utilized FEBs to model and estimate false alarm sources. For enhanced interpretability, we propose a framework that resembles the background suppression process utilized in non-deep learning approaches. The experimental results demonstrate that our model outperforms other state-of-the-art methods, achieving the highest and second-highest scores in the , , and AUC, while attaining the lowest among the DNN methods. The high performance of our network is achieved through the collaboration of a multi-stage progressive suppression of false alarm sources using FEBs, as well as target segmentation with a TSB. The multi-stage framework of TFCD-Net, which remains end-to-end, not only provides a path for improving the performance of existing algorithms, but also the false alarm sources estimated by the FEBs on multiple scales provide valuable data for subsequent studies. For example, FEBs can act as preprocessing modules to suppress complex backgrounds in both current and future algorithms for detecting small targets in infrared imagery. Furthermore, the estimated false alarm sources can serve as samples to generate datasets for training models specialized in false alarm source detection, filling the current gap in the field, and they can be used to augment existing datasets to increase the diversity of object types.
As limitations of our method, due to its multi-stage structure, the model has more parameters compared to lighter models, leading to longer training times. Also, while FEBs are intended to suppress false alarm sources, there may be instances where true targets are suppressed.
Future efforts could focus on enhancing the architecture to improve robustness and interpretability, and implementing the framework to detect targets in datasets acquired from additional sensor types or containing multiple target categories.