Infrared Dim and Small Target Detection Based on Background Prediction

Ma, Jiankang; Guo, Haoran; Rong, Shenghui; Feng, Junjie; He, Bo

doi:10.3390/rs15153749

Open AccessArticle

Infrared Dim and Small Target Detection Based on Background Prediction

by

Jiankang Ma

¹

,

Haoran Guo

¹,

Shenghui Rong

^1,*

,

Junjie Feng

²

and

Bo He

¹

Underwater Vehicle Laboratory, School of Information Science and Engineering, Ocean University of China, Qingdao 266000, China

²

State Key Laboratory of Safety and Control for Chemicals, SINOPEC Research Institute of Safety Engineering Co., Ltd., 339 Songling Road, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3749; https://doi.org/10.3390/rs15153749

Submission received: 10 May 2023 / Revised: 12 July 2023 / Accepted: 25 July 2023 / Published: 27 July 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Infrared dim and small target detection is a key technology for various detection tasks. However, due to the lack of shape, texture, and other information, it is a challenging task to detect dim and small targets. Recently, since many traditional algorithms ignore the global information of infrared images, they generate some false alarms in complicated environments. To address this problem, in this paper, a coarse-to-fine deep learning-based method was proposed to detect dim and small targets. Firstly, a coarse-to-fine detection framework integrating deep learning and background prediction was applied for detecting targets. The framework contains a coarse detection module and a fine detection module. In the coarse detection stage, Region Proposal Network (RPN) is employed to generate masks in target candidate regions. Then, to further optimize the result, inpainting is utilized to predict the background using the global semantics of images. In this paper, an inpainting algorithm with a mask-aware dynamic filtering module was incorporated into the fine detection stage to estimate the background of the candidate targets. Finally, compared with existing algorithms, the experimental results indicate that the proposed framework has effective detection capability and robustness for complex surroundings.

Keywords:

infrared dim and small target detection; background prediction; image inpainting; Region Proposal Network (RPN)

Graphical Abstract

1. Introduction

Infrared dim and small target detection technology has an essential role in target detection fields [1]. Nevertheless, due to long-distance and immature infrared imaging technology, the target is usually tiny (less than 80 pixels [2]) with a low signal-to-clutter ratio (SCR) and lacks shape and textural features. In addition, the background and noise in the complex environment may easily overwhelm targets. Therefore, infrared dim small target detection is a difficult and challenging task.

In order to extract targets from complex and variable backgrounds, many techniques and knowledge are applied to infrared dim small target detection algorithms. In the early stage, some methods based on background prediction [3,4,5] detect small targets by the difference between the infrared image and the predicted background obtained by local information of the image. They may have poor performance on complex backgrounds since background predictions obtained in the local domain do not meet the real requirement. For improving the detection quality of the target, some methods focus on features of targets to suppress the background and enhance the target [6,7,8,9,10]. Due to the salience of local features (local contrast, local gradient, etc.), these approaches effectively improve the SCR of the target and suppress clutters. However, targets may be not detected in images with buildings and highlights. Spare matrix methods [11,12,13] for small targets detection concern non-local correlation to separate small targets. They transform the problem of target detection into one of sparse matrix separation but, owing to the limitations of the sparse model, many false alarms are retained. In addition, some methods identify small targets using the local structure information [14,15]. The infrared patch-tensor model (IPT) was designed to extract more effective local information from the background’s non-local self-similarity. Recently, deep-learning-based methods [16,17,18] segment the target and background based on image features extracted by the deep neural network. The deep network extracts deep and low features of the target, so the detection accuracy is significantly improved. The above methods have different disadvantages in the process of detecting small targets. Nevertheless, many false alarms may exist in the results of many algorithms on account of the lack of global or local information.

In this paper, a background prediction-based method with coarse and fine detection was researched. Traditionally, background prediction-based methods for infrared dim small targets detection estimate the background by using the local correlation of pixels through filters and transformation. Though these methods have a better performance on smooth and simple backgrounds, due to ignoring the global information of the infrared image, they may not be able to detect the target correctly when facing complex scenes like buildings and highlighting noise. In contrast, deep learning can adapt to varying scenarios by learning a large number of training samples. Additionally, deep learning has the capability to accomplish various tasks by understanding the global semantic information of images [19] such as small targets detection [20], image inpainting [21], and face recognition [22]. Inspired by deep learning, a background prediction method utilizing image inpainting was proposed to detect dim and small targets in complex scenarios.

In this paper, the global information of the image was utilized to estimate the local scene in an IR image. Here, two prior assumptions of small targets were proposed to appropriately illustrate the background predictions of the paper. The assumptions are demonstrated as follows.

(1): Because of the size of the target, the remaining image after eliminating the target has a negligible impact on the image background semantics. Meanwhile, the semantics are able to predict the background in small target areas.
(2): The predicted background at the false target is similar to that of the original image. This paper assumed that the background clutter (building edges, highlighting noise, etc.) detected as the target is small. The background at these false targets is estimated by the method through information outside of these targets, which theoretically should be similar to the pixel values of the original image.

From the above assumptions, the coarse and fine detection framework requires a technique to predict the background of these suspicious target areas by understanding the image. Image inpainting has the ability to predict backgrounds in marked areas. It is a classical computer vision problem. Inpainting recovers the whole image via analyzing deep and shallow semantic features of the image using the neural network, which adds some reasonable inferences in the missing areas of the image [23]. A number of inpainting algorithms have the capability to handle large contiguous areas, which ignore the inner connection between isolated areas and the whole image. In this paper, an inpainting method with a mask-aware dynamic filtering (MADF) [24] module was adapted to estimate the background of candidate targets. Owing to the small size of the target, the candidate target areas are correspondingly tiny. Therefore, the two prior assumptions claim that inpainting can adapt to repair tiny areas. The method with MADF addresses the problem of repairing small and isolated areas in addition to repairing large and continuous areas.

In this paper, the detection framework was a structure containing coarse and fine detection. Deep learning was utilized to generate the candidate target areas in the coarse detection module, where the existing algorithm was utilized for detection. In this paper, this module was not focused on study or elaboration. Fine detection includes inpainting that predicts the background of candidate target areas and fusing images to detect small targets. Comparative experiments on test data demonstrate that the proposed framework is capable of detecting infrared small targets. Overall, the technical contributions of the paper are summarized in the following.

A coarse-to-fine infrared dim and small target detection framework was proposed to adapt to complex infrared image scenes. In coarse and fine detection modules, deep learning was utilized to detect candidate target areas and fine targets.
An image inpainting method with MADF was first employed to predict the background using global semantic information in the stage of fine detection.

The remaining sections of this paper are organized as follows. In Section 2, a brief review of related works for infrared dim and small target detection is presented. Section 3 introduces the architecture of the proposed framework in detail. Section 4 describes the experimental results and other terms relevant to the experiment. Section 5 includes a discussion of the results obtained qualitatively and quantitatively. Conclusions are drawn in Section 6.

2. Related Works

As an important technique for infrared detection fields, infrared dim small target recognition has been widely developed and applied in various detection tasks in recent years. Current methods for infrared dim small target detection are broadly classified into conventional methods and deep learning-based methods.

Traditionally, infrared dim and small target detection methods are generally divided into background prediction-based, local feature-based, sparse matrix-based, and infrared tensor model-based methods. In the initial stage, background prediction is employed to detect infrared dim and small targets, such as mean filter [25] and top-hat transform [26], etc. Recently, high-order statistics [3], an effective background model in the Fourier domain [4] and improved bilateral filtering [5] have been adopted into background prediction to detect small targets. These methods may detect many false alarms in complex scenarios, since the background predicted via local correlation (the information correlation between the pixels around the target and the pixels in the nearby neighborhood) may be inaccurate for the real background. Some researchers utilize a wide variety of features (local contrast, local gradient information, etc.) as prior information to detect targets. Chen et al. [6] proposed a method based on local contrast measure (LCM) to enhance small targets and suppress the background using local statistics. The multiscale patch-based contrast measure (MPCM) [27] develops the LCM to further increase the contrast between target and background. The weighted strengthened local contrast measure (WSLCM) [28] considers the characteristics of the target and the background and the difference between them. There are many methods originating from LCM like multiscale tri-layer local contrast measure (TLLCM) [9], multidirectional derivative-based weighted contrast measure (MDWCM) [29], strengthened robust local contrast measure (SRLCM) [30], etc. Moreover, other local characteristics are found to describe small targets. Local intensity and gradient properties (LIG) [31] characterizes two local properties of small targets from the perspective of intensity and gradient to obtain targets. Average absolute gray difference (AAGD) detects infrared small target by compensating for every single weak spot [32]. Local features can be applied well to enhance small targets; nevertheless, with the lack of target-background association of local features, buildings, and highlighted backgrounds are identified as targets in complex images. Some regard the infrared image as a matrix, transforming the detection problem into one of sparse matrix separation. The infrared patch-image model (IPI) [13] utilizing local patch construction was proposed to detect targets. Total variation weighted low-rank constraint (TVWLR) [11] and non-convex rank approximation minimization joint

l_{2, 1}

norm (NRAM) [33] developed an IPI model, proposing different low-rank matrix theories to detect small targets. The sparse matrix model is established by non-local correlation (the information association between pixels or blocks in the whole image) of infrared images. Therefore, it is challenging for these methods to accurately detect small targets in complex backgrounds. Some methods find local structural information helpful for small target detection. Dai et al. [14] designed a reweighted infrared patch-tensor (RIPT) model to detect small targets using local structural information. The RIPT transforms the IPI into an IR patch tensor (IPT) to extract more useful local information from the background’s non-local self-similarity. The partial sum of the tensor nuclear norm (PSTNN) [15] was proposed to obtain targets by constraining the low-rank background tensor. However, although local spatial information and local tensor structure priors are considered, the structural information of the image is insufficient to accurately generate targets in complex backgrounds. Overall, due to ignoring the global semantics information or local features, the above methods generally cannot adapt to the complex and changeable environment.

Considering that deep learning can adapt to different scenarios by learning to train samples, deep learning is rapidly being utilized in a variety of fields. Many researchers have proposed deep learning networks for infrared small target detection. Ref. [34] decomposes the detection task into two sub-tasks to reduce either misdetection (MD) or false alarm (FA). Attentional local contrast networks (ALCNet) [16] fuse attention mechanisms and local contrast into neural networks to detect targets. Dense Nested Attention Network (DNANet) [35] detects small targets by the contextual information enhanced by channel-spatial attention. Attention-Guided Pyramid Context Networks (AGPCNet) [36] enhance small targets and suppress context via combining global attention and local semantics. The Interior Attention-Aware Network (IAANet) [17] increases the interior relation between target and background pixels using the Transformer and Region Proposal Network (RPN) to detect small targets. Local Patch Network (LPNet) [37] demonstrates an infrared small target detection network fusing local information and global attention. A robust infrared small target detection network (RISTDNet) [20] increases the effectiveness and robustness of detection by incorporating handcrafted features and neural networks. The above methods utilize neural networks to extract target features, automatically learning classified target features to detect dim and small targets. Although they have achieved a superior performance, many methods have disregarded some contextual information and correlation between targets and background. This can lead to the loss of small targets and the existence of false alarms. Since the down-sampling operation ignores the features of small targets, the deep network results in the failure of detection. Nevertheless, its influence on the information of the background is virtually negligible. Hence, deep learning is able to understand global semantics information. In this paper, a coarse-to-fine detection framework based on deep learning was researched for deep networks that miss the features of small targets. Moreover, image inpainting repairs target candidate areas using the global information (the whole features of the image) of the whole image in the fine detection stage.

Inpainting is a very important image processing technology in the computer vision field. Meanwhile, inpainting is a typical application of deep learning to restore the missing areas via image understanding [38]. Pathak et al. [39] proposed an image inpainting algorithm combining CNN and GAN, which reproduces the missing areas using the information around these areas. Yang et al. [40] produced high-frequency details uisng the joint optimization of image content and texture constraints, addressing the task of filling large holes in high-resolution images. Liu et al. [41] presented a partial convolution method to mask and renormalize the convolution, which fixes the convolution in the valid regions to address image inpainting of an arbitrary mask. DeFLOCNet [42] generates the user-intended editing results via deep features guided by low-level controls (sparse sketch lines, color dots, etc.). Probabilistic Diverse GAN (PD-GAN) [21] modulates deep features of random noise to restore images from coarse to fine. There are other image inpainting methods based on deep learning, such as image inpainting detection network (IID-Net) [43] and GAN for pluralistic image inpainting (PiiGAN) [44]. The above algorithms reconstruct the missing regions via understanding the semantic information of the image by neural networks. Moreover, high-resolution RGB images are repaired with or without constraints. In this paper, an inpainting method with mask-aware dynamic filtering (MADF) module [24] was employed to estimate the background of infrared images in local areas. Following the two prior assumptions in Section 1, the target candidate areas are small and the background of these areas is predicted by inpainting. Compared with others, the method with MADF is able to generate convolution kernels with any size area for image inpainting. Therefore, the method should be suitable to be employed for background prediction in infrared small target images.

3. Proposed Method

In this section, the structure of the coarse-to-fine infrared detection framework is described in detail. Firstly, the architecture of the proposed method for infrared dim and small target detection is introduced, which is composed of a coarse detection module and a fine detection module. Then, the coarse detection module with Region Proposal Network (RPN) is elaborated to roughly detect small infrared targets. Target candidate areas are generated by RPN, where the candidate targets are obtained by Threshold Segmentation (TS). Ultimately, the design of inpainting to estimate background in the fine detection stage is demonstrated. The target is the difference between the image and the background prediction.

3.1. Model Architecture

Theoretically, an infrared image with dim and small targets is modeled as Equation (1), where an image is regarded as the sum of the target and background with noise.

I = T + B (N),

(1)

where I, T, and

B (N)

denote the infrared image, target, and the background with noise, respectively. In the paper, the method for predicting the background

B (N)

was researched. The

B (N)

is available from the proposed method. Furthermore, the target T is derived from Equation (1). The overall architecture of the proposed method is indicated in Figure 1.

In order to accurately predict the background, the process of obtaining

B (N)

is divided into two parts: predicting the background over a large area by acquiring candidate targets and estimating the background at candidate targets by background inference from the background outside candidate targets. Correspondingly, the detection framework consists of two components: the coarse detection module and the fine detection module. The coarse detection module generates the mask that represents the candidate targets with the infrared image I. The

l o c a l r e g i o n

is obtained by RPN, where there may be many candidate target regions. The infrared image is extracted on each one of the target candidate areas, in which the mask is obtained by threshold segmentation. In the fine detection stage, inpainting is a background prediction technology for estimating the background via global semantic information in these masks. The infrared image and mask are fed to inpainting to predict the background at these candidate targets. Following the two assumptions in Section 1, since the mask corresponding with candidate targets is tiny, inpainting is able to learn the global information of the background outside the mask, which reconstructs the background within the mask via the information. Therefore, it is key to detecting dim and small targets that the image inpainting algorithm estimates the background at the candidate targets. The image with mask

I^{m}

is calculated by multiplying the infrared image I with the mask.

I^{m}

and

m a s k

are both fed to the fine detection module to acquire the repaired image

I^{r d}

. Furthermore, in this paper,

I^{r d}

was considered as predicting the background at the candidate targets. In the Image Fusion module, the detection result of the target is obtained by the difference between the infrared image I and the repaired image

I^{r d}

.

3.2. Coarse Detection Module

Conventionally, many methods utilize certain manual features to detect small targets, which can obtain small targets with a pure background. However, as they lack distinction and connection between target and background, many algorithms for infrared small and dim target detection often suffer from high false alarm rates and poor accuracy in complex images. In order to eliminate the detection of false alarm regions, a coarse-to-fine infrared detection framework was proposed. In this part, the coarse detection module is introduced to roughly detect infrared dim and small targets.

The core detection method of the module is the Region Proposal Network (RPN), which classifies small targets and backgrounds using deep semantic features via proposal boxes. The Region Proposal Network was initially proposed as part of Faster R-CNN in [45]. In this paper, the Region Proposal Network was applied to generate target candidate regions that roughly indicate the possible targets. The architecture of the module with RPN is displayed in Figure 2. In order to better adapt to the detection of small infrared targets, the RPN optimized in [17] was utilized in our module. The down-sampling operation leads to the removal of small image features during image feature extraction by neural networks. For addressing this problem, the residual network was employed to extract the features of infrared images for RPN. The infrared image according to each of the candidate target regions is segmented to obtain the mask that represents the candidate targets. The mask is useful for estimating the background of infrared images.

As the input of RPN, the infrared image I is processed by RPN to acquire target candidate regions

I^{p}

. The features of an infrared image are extracted by a convolutional layer and six residual blocks. Finally, a convolutional layer converts the features into detection scores and anchor boxes. These anchor boxes describe the approximate ranges of the candidate targets. In

I^{p}

, the region corresponding to the detected rectangle is filled with black to indicate the domain of the candidate target. The course detection module uses RestNet18 as the backbone network to extract feature information from infrared images, where the first convolutional layer is replaced with a convolutional layer with a convolutional kernel of 7. The final detection layer is a convolutional layer with an output channel of 5. Subsequent processing of the coarse detection module is performed in these boxes. The original image I is chunked by candidate regions to extract primitive pixels as candidate blocks

I^{b l o c k}

. The

m a s k

is calculated by segmenting the block by a simple threshold in every block. The mask value outside these blocks is set to 1. In every block of

I^{b l o c k}

, the

m a s k

value matching with the block higher than the threshold is set as 0, otherwise, it is set as 1. After threshold segmentation, these blocks are merged as

m a s k

. Since the candidate targets represent areas that may belong to real targets, the areas outside candidate targets are considered as background without any targets. Thus, in the

m a s k

, the values of the areas outside candidate targets are set to 1 to indicate the background.

3.3. Fine Detection Module

Traditionally, background prediction methods for infrared dim and target detection estimate the background by filtering and transforming, lacking the understanding of the global information of the image, which leads to detection results containing false alarm interference from the background. Hence, in this paper, aiming to obtain a better background prediction, a deep learning-based inpainting algorithm was adopted to predict the background in the fine detection stage. Image inpainting is a technique to restore the missing areas by the semantic features of images. The features are extracted outside the

m a s k

where the target candidate areas are set to 1 and others are set to 0. Since infrared dim and small targets account for only less than 80 pixels, the candidate targets masked are very small. Therefore, the results after image inpainting can be theoretically considered as background in these areas. Generally, the background is reasonably inferred by image inpainting via understanding the original image outside the mask.

Tracing back to the coarse detection module, there are many target candidate areas where it is not sufficient to accurately detect via direct segmentation due to the interference of noise and a bright background. There are two circumstances in these areas: (1) An area totally belongs to the background. Under this circumstance, the background is incorrectly considered as the target. The subsequent task is to re-judge the area as background by inpainting. (2) A narea includes small targets. The key to detecting small targets is that inpainting must follow global semantic information to generate the background projections for the area.

The masks, obtained from the target candidate regions using threshold segmentation, have small and isolated shapes. However, many existing algorithms for image inpainting repair RGB images by using large and continuous masks, which lack understanding of the semantic information of infrared dim and small target images. Thus, these algorithms are not suitable for repairing infrared dim small target images. In this paper, an image inpainting method with MADF was utlized to better predict the background of infrared images. The algorithm is able to generate the convolutional kernel following the shape of masks. Thus, it has a good inpainting effect on RGB images with small and isolated masks.

Figure 3 indicates the architecture of the fine detection module. In the fine detection stage, inpainting employs a cascaded neural network structure with mask awareness proposed in [24]. The structure of inpainting is an encoder-decoder network. The whole network has seven layers with the same structure and different network parameters. Every layer includes an encoder module and two decoder modules. There are two functions of the encoder: (1) extract the shallow and deep features for understanding the global features and semantics of the images by convolution layer; (2) integrate mask awareness into feature maps using the MADF module. The decoder is divided into a recovery decoder and a refinement decoder. The recovery decoder is designed to roughly restore missing areas of images from deep to shallow. The network is a process in which images with masks are gradually repaired. Each layer takes as input the output of the previous layer and the feature map of the previous stage at the same depth. After the recovery decoder, the feature map is optimized by the refinement decoder to generate delicate inpainting results. The image with the mask is fed to the encoder with MADF to generate feature maps, applied by the decoder to recover the image. Meanwhile, these feature maps are loaded into the deep encoder layer to generate deep features. The final repaired results are obtained by the recovery decoder and the two refinement decoders. In order to ensure that the image inpainting algorithm adapts to infrared dim and small target image, the cloud dataset demonstrated in Figure 4 is applied to train the inpainting algorithm with MADF. Finally, the detection result is obtained by the difference between the infrared image and the final repaired result in the Image Fusion module.

4. Results

In this section, a sky clouds dataset for training image inpainting method is introduced. The basic content of the datasets and evaluation metrics in the experiment is demonstrated. Then, in the experiment, several existing infrared small target detection methods were compared with the proposed ones. Ultimately, qualitative and quantitative analysis and a discussion of experimental results are presented.

4.1. Datasets

Recently, many image inpainting methods specialize in processing RGB images, leading to not being applicable to infrared images. For the purpose of obtaining high-quality repaired results of infrared images, a dataset of pure infrared cloud images was proposed for the training phase of image inpainting. This dataset is composed of various cloud images. Images of the dataset were acquired by cropping and segmenting original images in MATLAB 2019b. These images were made into a dataset containing 43,500 infrared background images including cirrus, stratus, cirrocumulus clouds, etc. Part of the dataset is displayed in Figure 4.

Moreover, to evaluate the performance of the proposed framework, three public datasets were utilized, including the SIRST dataset [46], the IRSTD-1k dataset [47], and the NUDT-SIRST dataset [18]. The SIRST dataset consists of 427 typical images with various scenes isolated from hundreds of real videos, which is widely applied to detect infrared dim and small targets [46]. The IRSTD-1k dataset is a publicly available dataset consisting of many infrared dim small target images of complex backgrounds. It includes 1000 images that have varying-shape targets and low contrast and low SCR background with clutter and noise and ground truth images corresponding to them [47]. The NUDT-SIRST dataset is composed of 1024 composite images of five main background scenes and a few IR images proposed in [48]. The MFIRST [34] dataset includes 9956 training images and 100 test images. These images come from realistic infrared sequences and synthetic infrared images [46]. The MFIRST was utilized to train the model of the coarse detection module.

4.2. Implementation Details

The size of the original image was set to

256 \times 256

. To ensure that the Region Proposal Network acquired target candidate areas, the size of the anchor box was fixed as

10 \times 10

. In the proposed framework, RPN and image inpainting algorithm were trained, respectively. RPN was trained separately by the similar loss function of [17] in the MFIRST dataset [34]. Our training model of image inpainting was constructed on the basis of the network, which had a favorable repair effect on high-resolution color images in [24]. Therefore, for the purpose of convenience, one of the pre-trained models (the model for Places2 [41]) in [24] was employed as the starting point of our model. The training of irregular masks was from datasets of masks in [41]. The detailed parameters of the model were similar to those of the training process in [24]. The learning rate was set to 0.0002. We trained our model for about 300 K iterations. Our model was trained on a GeForce RTX 2080 GPU (8G) with a batch size of 8.

4.3. Evaluation Metrics Indicators

In this paper, in order to examine the performance of the proposed framework and compare the effectiveness of different methods, as with many segmentation-based dim small target detection methods, several evaluation metrics were adopted, such as precision rate, recall rate, F-measure, and the 3D receiver operating characteristic curve (3D-ROC).

F-measure: The F-measure is a primary evaluation metric in object segmentation methods [33]. It balances precision rate (Prec) and recall rate (Rec) to favorably represent the ability to precisely detect targets with fewer false alarms. The Prec and Rec are defined as follows.

Prec = \frac{T_{P}}{T_{p} + F_{p}},

(2)

Rec = \frac{T_{P}}{T_{p} + F_{N}},

(3)

where

T_{P}

and

F_{P}

denote the number of successfully detected target pixels and background pixels that are incorrectly detected as targets, respectively.

F_{N}

denotes the number of ground truth target pixels that are mistakenly recognized as background. The F-measure is defined as [34] by Prec and Rec,

F - measure = \frac{(1 + β^{2}) Prec * Rec}{β^{2} Prec + Rec},

(4)

where

β

is a constant. In this paper, it was set as 1. Thus, the F-measure is named the F1-measure (F1).

3D Receiver Operation Characteristics: 3D Receiver Operation Characteristics (3D-ROC) is a target level metric that is utilized to record and represent the relationship between the detection probability (

P_{d}

), different false alarm rate (

F_{a}

), and threshold (

τ

). The detection probability (

P_{d}

) and false alarm rate (

F_{a}

) are defined as [35]:

P_{d} = \frac{T_{true detections}}{T_{gt}},

(5)

F_{a} = \frac{P_{false}}{P_{image}},

(6)

where

T_{true detections}

and

T_{gt}

denote the number of targets that are truly detected and ground-truth targets, respectively.

P_{false}

and

P_{image}

represent the number of the false detection pixels and total pixels of images. Here, if the distance between the centers of the ground truth and the predicted result is less than four pixels, the predicted result is regarded as a correct detection. Otherwise, the result is viewed as false. The threshold (

τ

) is varied from 0 to 255 at an interval of

255 / 50

.

4.4. Contrast Methods and Parameter Setting

To assess the detection performance of the proposed framework for different scenarios, in this part, we implemented comparison experiments employing different principles of comparative methods. Conventional methods include background prediction-based, local feature-based, and sparse matrix-based approaches like Top-Hat [26], Local Intensity and Gradient properties (LIG) [31], Average Absolute Gray Difference (AAGD) [32], Multiscale Tri-layer Local Contrast Measure (TLLCM) [9], Non-convex Rank Approximation Minimization (NRAM) [33], and Partial Sum of the Tensor Nuclear Norm (PSTNN) [15]. In recent years, many deep learning-based infrared small target detection methods have been proposed, like Attentional Local Contrast Network (ALCNet) [16], Dense Nested Attention Network (DNANet) [35], Attention-Guided Pyramid Context Network (AGPCNet) [36], etc. In this paper, the parameter settings of these traditional methods were chosen according to the parameters suggested in their papers, as illustrated in Table 1. For methods based on deep learning, the optimal models from the original papers were adopted as the detection network models.

4.5. Contrast Experiment Results

4.5.1. Qualitative Comparison

In this section, many infrared images were used to check the performance of comparison algorithms. Among these data, several images with various backgrounds were selected as representative images to display the results. Figure 5 illustrates these original images, where targets are marked by red boxes.

In Figure 5, images (1), (2), and (5) consist of many natural backgrounds (trees, mountains, etc.) and weak targets. Images (3), (4), and (8) are composed of irregular clouds. There is no denying that the building noise makes target detection extremely challenging, as illustrated in images (6) and (7). The above images all contain some bright and complex backgrounds. The pixel value of backgrounds is higher than the target, leading to the small target being easily overwhelmed. Figure 6 demonstrates the 3D results of comparative methods. (a1–a8) represent the 3D surfaces corresponding to Figure 5(1)–(8). The background of each image is complicated, even if the target is so weak that it is not visibly noticeable. In the figure, to conveniently compare the detection performance of various methods, these results of algorithms were normalized to be in the range of 0–1. Moreover, small targets are masked by red boxes, while blue circles represent the clutter in results.

Figure 6h is the 3D representation of the ground truth. Intuitively, TopHat has a weak ability to detect small targets. Due to the morphological filtering of backgrounds in local regions, a large amount of background clutter remains in the results, even drowning out targets. Local feature-based methods such as LIG, AAGD, and TLLCM are better than TopHat in visual terms. In relatively smooth backgrounds, LIG and AAGD are able to detect small targets; however, there is still background noise in the results. On the contrary, they achieve a poor detection performance for the complicated background with building edges, complex ground, strong cloud clutters, etc. TLLCM is able to separate small targets from the background better, except for the scenes in Figure 5(2), (3) and (6). Especially, when TLLCM processes multi-target images, the second target is severely weakened, as shown in Figure 6(k3),(k6). NRAM and PSTNN convert the problem of target detection to one of sparse matrix separation; nevertheless, in Figure 6(f1–f7),(g1–g7), background clutters are separated as targets with a high false alarm rate, even with failure in detecting Figure 5(4). ALCNet, DNANet, and AGPCNet are state-of-the-art deep learning-based methods with higher accuracy and lower false alarm rate compared to the traditional methods. Unfortunately, the results of ALCNet contain a lot of noise when dealing with complex backgrounds, and in the Figure 5(3) and (4) scenarios, it cannot even successfully detect targets. DNANet missed the real target and detects the background as the target when detecting Figure 5(4). In the case of multiple targets, such as Figure 5(3), (4), and (8), three deep learning-based methods suffer from detection misses. Our method performs local background prediction on target candidate regions generated in a coarse detection module to detect dim and small targets. In comparison, as demonstrated in Figure 6(n1),(n4),(n6–n8), our proposed algorithm can achieve the detection of targets better, although the results contain individual clutter in some scenarios.

4.5.2. Quantitative Comparison

In this part, 3D-ROC curves and F1-measure-threshold curves are compared to quantify the performance of different algorithms. The 3D-ROC curves and three 2D-ROC curves in the three datasets are demonstrated in Figure 7, Figure 8, and Figure 9, respectively. The 3D-ROC curve indicates the relationship of the detection probability (Pd), false alarm rate (Fa), and threshold (

τ

). In Figure 7, Figure 8 and Figure 9b–d represent the ROC curves between them, respectively. In this paper, the detection probability and false alarm rate, suitable for small infrared targets, were chosen to robustly evaluate the performance of detection. The detection probability is the ratio of the number of correctly detected targets and real targets. The false alarm rate is the ratio of the number of pixels detected incorrectly to the total number of pixels in the image. ROC is a metric that exhibits the potency of the method by different thresholds. In Figure 7 and Figure 8, the ROC of (Pd, Fa) shows Pd of the proposed method increases rapidly with Fa. The Pd reaches a maximum of about 87% at

F a = {7.5}^{- 5}

, as displayed in Figure 7b. The proposed method obviously outperforms other comparison algorithms but is lower than DNANet at about

F a > 7^{- 5}

. In Figure 8b, our method rapidly converges to the highest detection probability when

F_{a} < 2 \times 10^{- 5}

, even reaching 95%. In the ROC of (Pd,

τ

) of Figure 7 and Figure 8, our algorithm basically comes out on top over the other algorithms. However, the method in this paper is not outstanding, as illustrated in Figure 7d and Figure 8d. Unfortunately, as demonstrated in Figure 9b, although the proposed algorithm is faster to stabilize, the detection probability is lower than that of ALCNet, AGPCNet, and DNANet. Figure 9c,d indicate that our algorithm is superior to the traditional approaches; however, the deep learning-based methods have a better performance than the proposed framework.

The F1-measure-threshold curves of contrast methods on three public datasets are shown in Figure 10, respectively. In Figure 10, the vertical axis and horizontal axis are the F1-measure (F1) and threshold, respectively. F1 is an evaluation metric that balances precision rate and recall rate. The F1 curves of our framework on IRSTD-1k and NUDT-SIRST datasets are better than those of the other comparison algorithms, as illustrated in Figure 10i,ii. Nonetheless, Figure 10iii indicates that the F1 of ALCNet, AGPCNet, DNANet, and PSTNN are higher than our framework on the SIRST dataset when the threshold is in the range of 0 to 0.15. As the threshold gradually increases, the F1 of our algorithm is gradually higher than the other algorithms, second only to LIG.

To further compare the detection capabilities of each algorithm, we obtained the data in Table 2 by selecting fixed thresholds from Figure 10. The Prec, Rec, and F-measure in Table 2 were calculated by the F-measure calculation method described in [34], as displayed in Equations (2)–(4). Here, the threshold of calculation was set to 0.15. The Prec reflects the proportion of correct results in the results detected as a target, as shown in Equation (2). The Rec denotes the ratio of the number of true targets and the total number of targets, as defined in Equation (3). The F1-measure balances the performance of the precision rate and recall rate. Ideally, the Prec, Rec, and F1-measure are both 1. In the table, the largest value has been bolded, while the second largest value is masked by an underscore. For IRSTD-1k and NUDT-SIRST, the Rec and F1-measure of our proposed method are both better than other methods, but the performance of the precision rate is worse than DNANet. Unfortunately, our framework is less advantageous compared to other algorithms on the SIRST dataset, since most comparison methods have a similar metrics performance to the proposed algorithm, as demonstrated in Table 2. Though our detection framework is flawed in some aspects, its capability to detect dim and small targets is an improvement over most algorithms. In conclusion, it can be seen through a quantitative evaluation of the metrics that our framework has an improved performance compared to some existing methods. Nevertheless, in this paper, the method still detected a number of false alarms when small targets on complex backgrounds were detected.

4.5.3. Ablation Experiments

Ablation experiments were designed to verify the rationality of each module and the setting of key operations in this section. In the proposed framework, the coarse detection module was designed to roughly detect small infrared targets. Candidate targets were generated by the coarse detection module and were marked by the

m a s k

. The original image and

m a s k

were fed to the fine detection module to predict the background at candidate targets. Finally, the target was obtained by subtracting the predicted background from the original image. In order to prove the effectiveness of the framework, we designed ablation experiments to verify the influence of the threshold segmentation of the coarse detection module on the experimental results and the influence of the fine detection module on the coarse detection results. For convenience, the algorithm without threshold segmentation in the coarse detection module, the coarse detection module, and the complete algorithm are referred to as Ours-nTS, CDM, and Ours, respectively. In order to evaluate the performance of modules in the framework, the 3D results of Ours-nTS, CDM, and Ours are indicated in Figure 6. Figure 6(l–n) illustrate the result of Ours-nTS, CDM, and Ours, respectively. The ROC and F1-threshold curves of Ours-nTS, CDM, and Ours are exhibited in Figure 7, Figure 8, Figure 9 and Figure 10.

Comparisons of the detection results before and after adding threshold segmentation in the course detection module. Ours-nTS detects a clutter band around the target, as shown in Figure 6(l3)–(l8). In the scenarios of Figure 5(1),(8), Ours-nTS detects the background clutter, and the results of Ours can effectively enhance the target and suppress the background clutter. The results of Ours have finer targets than that of Ours-nTS. Unfortunately, as Figure 5(2),(5) demonstrate, the results of Ours have similar background clutter to Ours-nTS. As Figure 7 and Figure 8 show, the ROC of (Pd, Fa) and (Pd,

τ

) of Ours are both upper more than Ours-nTS. In Table 2, on the IRSTD-1k and NUDT-SIRST datasets, the Rec and the F1-measure of Ours are superior to those of Ours-nTS; however, Ours is slightly smaller than Ours-nTS in terms of precision rate. In the SIRST dataset, the Prec, Rec, and F1-measure of Ours are all higher than those of Ours-nTS. In conclusion, compared to Ours-nTS, Ours has superior detection.

Comparisons of the results of CDM and the final results. CDM provides approximate ranges of candidate targets that are not sufficient as a final result for target detection because it often contains background clutter above the target, as displayed in Figure 6(m1),(m3),(m5) and (m7).As illustrated in Figure 6, the results of Ours are finer than those of the coarse detection module. The ROC of (Pd, Fa), (Pd,

τ

), and (Fa,

τ

) of Ours is above that of CDM, as demonstrated in Figure 7, Figure 8 and Figure 9. Table 2 shows that the Prec, Rec, and F1-measure of CDM cannot all outperform those of Ours, regardless of which of the three datasets.

Overall, in some scenarios, the detection results after adding threshold segmentation are superior to the detection results without threshold segmentation. The target is enhanced and the background clutter is filtered out in the results of Ours. However, in some backgrounds, the enhancement of detection results by threshold segmentation is not very obvious due to the extremely weak target and the complexity and variability of the background. The fine detection module has the effect of optimizing the target detection accuracy for the results of coarse detection.

5. Discussion

As mentioned earlier, we proposed a coarse-to-fine infrared dim and small target detection method, which detects small targets via estimating the background of infrared images. In this section, the role of two detection stages in infrared dim and small target detection are discussed from the results.

The proposed method is composed of a coarse detection module and a fine detection module. On the one hand, the coarse detection module aims to set the background outside the candidate target region by detecting the candidate targets. On the other hand, it generates the

m a s k

for the next stages. The fine detection module estimates the background inside the candidate target by that of outside these areas through an image inpainting algorithm. The experimental results on the three datasets demonstrate that the framework has a better accuracy performance than traditional methods on some images. As Figure 6(n1)–(n3),(n5)–(n7) indicate, the background clutter (noise) is also detected in the results; however, the value of clutter is negligible in some results of (n1), (n3), and (l7). As can be seen in Table 2, the Prec, Rec, and F1 on the NUDT-SIRST dataset are 71.41%, 76.82%, and 69.86%, respectively, which could meet the index requirements for the detection ability of many infrared scenes. The ROC curves and F1-measure-threshold curves on IRSTD-1k and NUDT-SIRST datasets indicate that the framework has an outstanding performance over most methods.

In this paper, to conveniently illustrate the principle, two prior assumptions were made in Section 1: (1) Due to the higher pixel value of the target than the background, the background in the small target neighborhood is estimated by the remaining image after eliminating the targets; (2) Since the pixel structures of background clutter (noise) are different from those of the target, the background in candidate targets (building edges, highlighting noise, etc.) is predicted and similar to that of the origin image. The coarse detection module obtains the target frame from the depth features in which the candidate targets are generated by simple threshold segmentation. These candidate targets (real targets, suspicious targets, etc.) are fed to the next stage in the form of masks. In this stage, the mask is directly related to the performance of the detection result. As demonstrated in Figure 6(n5),(n6), the results contain a number of obvious background noises. From the experiment, the ability to generate masks is adequately presented. Following assumptions, the generation of masks is the key to the whole algorithm. However, the mask generation still has unfriendly aspects due to the limitation of training datasets and the neglect in networks of the feature of small targets. In the fine detection module, the background within these areas is predicted by an infrared image outside the candidate targets. Ideally, the estimation of the background within non-targeted regions does not differ from the infrared image. Unfortunately, the results shown in Figure 6 indicate that there are minor flaws in the algorithm for the prediction of non-target regions, even though the target is not detected, as shown in Figure 6(n2). Therefore, subsequent research is still needed to optimize the background generation to improve the target detection performance. Although the proposed algorithm detects some background clutter, the experimental results and data demonstrate that its detection ability is stronger than that of many traditional methods and some deep learning-based methods.

6. Conclusions

Generally, we proposed a coarse-to-fine deep learning-based framework for infrared dim and target detection. In the coarse detection stage, RPN was utilized to generate target candidate areas, where the mask for labeling candidate targets was obtained by threshold segmentation. Inpainting, as a technology for predicting the background, was employed to restore the background of images by global semantics information in these masks, which further optimizes the performance of target detection in coarse detection. In the experiments, the proposed method was compared with other existing methods on several publicly available datasets, verifying that the proposed method outperforms the currently existing methods in both subjective visual quality and objective quantitative measurements. However, since the dataset for training the image inpainting algorithm is mostly sky cloud background, the proposed method lacks robustness for infrared images with extremely complex backgrounds. The code and pre-trained model of the proposed framework will be made available at https://github.com/mjkcv/Infrared-Small-Target-Detection.

Author Contributions

J.M. and S.R. proposed the original idea. B.H. and J.F. provided the spaces and equipment. J.M. and H.G. collected data on infrared dim and small targets and performed the experiments. J.M. wrote the manuscript. J.M. and H.G. reviewed and edited the manuscript. S.R. contributed to the direction, and content, and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grants No. 62001443, the Natural Science Foundation of Shandong Province under Grants No. ZR2020QE294, the Natural Science Foundation of Jiangsu Province under Grants No. BK20210064, and Wuxi Innovation and Entrepreneurship Fund “Taihu Light” Science and Technology (Fundamental Research) Project under Grants No. K20221046.

Data Availability Statement

The data used for training and test set are available at: https://github.com/YimianDai/sirst, https://github.com/RuiZhang97/ISNet, https://github.com/jzchenriver/IRST640 and https://github.com/wanghuanphd/MDvsFA_cGAN, accessed on 20 April 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rawat, S.S.; Alghamdi, S.; Kumar, G.; Alotaibi, Y.; Khalaf, O.I.; Verma, L.P. Infrared small target detection based on partial sum minimization and total variation. Mathematics 2022, 10, 671. [Google Scholar] [CrossRef]
Zhang, W.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking. In Proceedings of the International Conference on Neural Networks and Signal Processing, Nanjing, China, 14–17 December 2003; Volume 1, pp. 643–647. [Google Scholar]
Jiao, J.; Lingda, W. Infrared dim small target detection method based on background prediction and high-order statistics. In Proceedings of the International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp. 53–57. [Google Scholar]
Zhou, A.; Xie, W.; Pei, J. Background Modeling in the Fourier Domain for Maritime Infrared Target Detection. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2634–2649. [Google Scholar] [CrossRef]
Hu, Z.; Su, Y. An infrared dim and small target image preprocessing algorithm based on improved bilateral filtering. In Proceedings of the International Conference on Computer, Blockchain and Financial Development (CBFD), Nanjing, China, 23–25 April 2021; pp. 74–77. [Google Scholar]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A Robust Infrared Small Target Detection Algorithm Based on Human Visual System. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared Small Target Detection Utilizing the Multiscale Relative Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A Local Contrast Method for Infrared Small-Target Detection Utilizing a Tri-Layer Window. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1822–1826. [Google Scholar] [CrossRef]
Xiong, B.; Huang, X.; Wang, M. Local Gradient Field Feature Contrast Measure for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2021, 18, 553–557. [Google Scholar] [CrossRef]
Chen, X.; Xu, W.; Tao, S.; Gao, T.; Feng, Q.; Piao, Y. Total Variation Weighted Low-Rank Constraint for Infrared Dim Small Target Detection. Remote Sens. 2022, 14, 4615. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, L.; Wang, X.; Shen, F.; Pu, T.; Fei, C. Edge and Corner Awareness-Based Spatial-Temporal Tensor Model for Infrared Small-Target Detection. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10708–10724. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted Infrared Patch-Tensor Model With Both Nonlocal and Local Priors for Single-Frame Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Peng, Z. Infrared Small Target Detection Based on Partial Sum of the Tensor Nuclear Norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef] [Green Version]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Chen, G.; Wang, W.; Tan, S. IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection. Remote Sens. 2022, 14, 3258. [Google Scholar] [CrossRef]
Hariharan, B.; Arbelaez, P.; Girshick, R.; Malik, J. Object Instance Segmentation and Fine-Grained Localization Using Hypercolumns. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 627–639. [Google Scholar] [CrossRef]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust Infrared Small Target Detection Network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Liu, H.; Wan, Z.; Huang, W.; Song, Y.; Han, X.; Liao, J. PD-GAN: Probabilistic diverse GAN for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9371–9381. [Google Scholar]
Lin, Y.; Xie, H. Face gender recognition based on face recognition feature vectors. In Proceedings of the IEEE International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2020; pp. 162–166. [Google Scholar]
Wu, H.; Zhou, J.; Li, Y. Deep generative model for image inpainting with local binary pattern learning and spatial attention. IEEE Trans. Multimed. 2021, 24, 4016–4027. [Google Scholar] [CrossRef]
Zhu, M.; He, D.; Li, X.; Li, C.; Li, F.; Liu, X.; Ding, E.; Zhang, Z. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Trans. Image Process. 2021, 30, 4855–4866. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets, Denver, CO, USA, 20–22 July 1999; Volume 3809, pp. 74–83. [Google Scholar]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared Small Target Detection Based on the Weighted Strengthened Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Lu, R.; Yang, X.; Li, W.; Fan, J.; Li, D.; Jing, X. Robust Infrared Small Target Detection via Multidirectional Derivative-Based Weighted Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Li, Z.; Liao, S.; Zhao, T. Infrared Dim and Small Target Detection Based on Strengthened Robust Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, L.; Yuan, D.; Chen, H. Infrared small target detection based on local intensity and gradient properties. Infrared Phys. Technol. 2018, 89, 88–96. [Google Scholar] [CrossRef]
Aghaziyarati, S.; Moradi, S.; Talebi, H. Small infrared target detection using absolute average difference weighted by cumulative directional derivatives. Infrared Phys. Technol. 2019, 101, 78–87. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared Small Target Detection via Non-Convex Rank Approximation Minimization Joint l2,1 Norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 1–13. [Google Scholar] [CrossRef]
Chen, F.; Gao, C.; Liu, F.; Zhao, Y.; Zhou, Y.; Meng, D.; Zuo, W. Local Patch Network with Global Attention for Infrared Small Target Detection. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3979–3991. [Google Scholar] [CrossRef]
Wang, N.; Zhang, Y.; Zhang, L. Dynamic selection network for image inpainting. IEEE Trans. Image Process. 2021, 30, 1784–1798. [Google Scholar] [CrossRef] [PubMed]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6721–6729. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Liu, H.; Wan, Z.; Huang, W.; Song, Y.; Han, X.; Liao, J.; Jiang, B.; Liu, W. Deflocnet: Deep image editing via flexible low-level controls. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10765–10774. [Google Scholar]
Wu, H.; Zhou, J. IID-Net: Image inpainting detection network via neural architecture search and attention. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1172–1185. [Google Scholar] [CrossRef]
Cai, W.; Wei, Z. PiiGAN: Generative adversarial networks for pluralistic image inpainting. IEEE Access 2020, 8, 48451–48463. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Hui, B.; Song, Z.; Fan, H.; Zhong, P.; Hu, W.; Zhang, X.; Ling, J.; Su, H.; Jin, W.; Zhang, Y. A dataset for infrared detection and tracking of dim-small aircraft targets under ground/air background. China Sci. Data 2020, 5, 291–302. [Google Scholar]

Figure 1. The architecture of the proposed coarse-to-fine framework based on deep learning. I denotes the IR image (the input of our framework). The

r e s u l t

is the detection result for infrared small targets in the framework. The coarse detection module and fine detection module are two parts of the proposed framework. The

l o c a l r e g i o n (I^{p})

denotes the candidate targets regions. The

m a s k

is calculated to mask candidate targets by the coarse detection module.

I^{m}

denotes the original image marked by masks.

I^{r d}

is generated by the inpainting algorithm by understanding the feature of

I^{m}

. In the Image Fusion module, the

r e s u l t

is calculated by the difference between I and repaired image

I^{r d}

.

Figure 1. The architecture of the proposed coarse-to-fine framework based on deep learning. I denotes the IR image (the input of our framework). The

r e s u l t

is the detection result for infrared small targets in the framework. The coarse detection module and fine detection module are two parts of the proposed framework. The

l o c a l r e g i o n (I^{p})

denotes the candidate targets regions. The

m a s k

is calculated to mask candidate targets by the coarse detection module.

I^{m}

denotes the original image marked by masks.

I^{r d}

is generated by the inpainting algorithm by understanding the feature of

I^{m}

. In the Image Fusion module, the

r e s u l t

is calculated by the difference between I and repaired image

I^{r d}

.

Figure 2. The architecture of coarse detection module with Region Proposal Network (RPN).

Figure 3. The architecture of the fine detection module. The E, D, and Refine_D denotes encoder layer, recovery decoder layer, and refinement decoder layer, respectively. The output1, output2, and repaired image are the three outputs of the decoder. The repaired image is the result of the finest image in the painting.

Figure 4. The part of the training dataset. The dataset includes infrared images of various cloud formations.

Figure 5. The illustration of representative images, which contain a variety of backgrounds. (1), (2), (5) The images of IRSTD-1k dataset [47]. (3), (6), and (8) are images from the NUDT-SIRST dataset [18]. (4) and (7) are two samples of the SIRST dataset [46].

Figure 6. Illustration of the results of different methods. (1–8) represent the results corresponding to the five images in Figure 5. (a–n) denote the 3D result representations of the original image, Top-Hat, LIG, AAGD, TLLCM, NRAM, PSTNN, the Ground-Truth, ALCNet, AGPCNet, DNANet, ours without threshold segmentation of course detection module (Ours-nTS), course detection module (CDM), and the proposed method (Ours), respectively. Here, red boxes and blue circles denote targets and false alarms, respectively.

Figure 7. 3D ROC curves with the corresponding 2D ROC curves of the comparative methods for the IRSTD-1k dataset [47]. (a) 3D ROC curves. (b) ROC curves of (Pd, Fa). (c) ROC curves of (Pd,

τ

). (d) ROC curves of (Fa,

τ

) of IRSTD-1k dataset.

Figure 7. 3D ROC curves with the corresponding 2D ROC curves of the comparative methods for the IRSTD-1k dataset [47]. (a) 3D ROC curves. (b) ROC curves of (Pd, Fa). (c) ROC curves of (Pd,

τ

). (d) ROC curves of (Fa,

τ

) of IRSTD-1k dataset.

Figure 8. 3D ROC curves with the corresponding 2D ROC curves of the comparative methods for the NUDT-SIRST dataset [18]. (a) 3D ROC curves. (b) ROC curves of (Pd, Fa). (c) ROC curves of (Pd,

τ

). (d) ROC curves of (Fa,

τ

) of the NUDT-SIRST dataset.

Figure 8. 3D ROC curves with the corresponding 2D ROC curves of the comparative methods for the NUDT-SIRST dataset [18]. (a) 3D ROC curves. (b) ROC curves of (Pd, Fa). (c) ROC curves of (Pd,

τ

). (d) ROC curves of (Fa,

τ

) of the NUDT-SIRST dataset.

Figure 9. 3D ROC curves with the corresponding 2D ROC curves of the comparative methods for the SIRST dataset [46]. (a) 3D ROC curves. (b) ROC curves of (Pd, Fa). (c) ROC curves of (Pd,

τ

). (d) ROC curves of (Fa,

τ

) of the SIRST dataset.

Figure 9. 3D ROC curves with the corresponding 2D ROC curves of the comparative methods for the SIRST dataset [46]. (a) 3D ROC curves. (b) ROC curves of (Pd, Fa). (c) ROC curves of (Pd,

τ

). (d) ROC curves of (Fa,

τ

) of the SIRST dataset.

Figure 10. F1-measure-threshold curves on (i) IRSTD-1k dataset [47], (ii) NUDT-SIRST dataset [18], and (iii) SIRST dataset [46].

Table 1. The parameters of the conventional methods.

Methods	Parameters
TopHat	structure shape: square, size: $5 \times 5$
LIG	window size: $11 \times 11$ , $k = 0.2$
AAGD	internal window scale: [3, 5, 7, 9],
	external window size: $19 \times 19$
TLLCM	Gaussian kernel size: $3 \times 3$ , scale: [3, 5, 7, 9]
NRAM	patch size: $50 \times 50$ , slide step = 10, $λ = \sqrt{m i n (m, n)}$
PSTNN	patch size = 40, slide step = 40, $λ$ = 0.7

Table 2. The average Prec, Rec, and F1-measure of IRSTD-1k, NUDT-SIRST, and SIRST datasets.

Methods	IRSTD-1k			NUDT-SIRST			SIRST
Methods	Prec (%)	Rec (%)	F1 (%)	Prec (%)	Rec (%)	F1 (%)	Prec (%)	Rec (%)	F1 (%)
TopHat	42.19	52.66	34.53	41.04	18.16	16.18	65.83	29.63	34.88
LIG	53.43	59.41	47.03	30.63	46.27	29.43	85.80	66.06	69.58
AAGD	25.63	56.27	25.31	1.83	25.94	2.69	61.05	66.14	53.56
NRAM	58.99	30.11	34.35	38.22	8.43	12.27	87.05	37.88	50.10
TLLCM	60.70	56.21	51.63	39.62	63.70	41.44	74.20	23.27	32.55
PSTNN	45.52	59.17	44.82	24.83	36.97	25.16	84.84	61.70	67.70
ALCNet	60.85	38.59	44.37	15.67	4.10	6.00	87.56	55.08	65.18
AGPCNet	55.37	50.58	49.85	28.56	11.63	14.88	83.03	66.59	70.91
DNANet	81.08	42.41	52.99	86.59	22.66	34.59	89.17	43.93	57.02
Ours-nTS	69.85	52.33	54.16	74.59	17.26	24.80	84.33	46.91	56.50
CDM	59.87	53.96	50.74	60.32	71.69	60.90	77.05	52.35	57.92
Ours	67.16	65.83	61.20	71.41	76.82	69.86	85.67	54.86	63.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, J.; Guo, H.; Rong, S.; Feng, J.; He, B. Infrared Dim and Small Target Detection Based on Background Prediction. Remote Sens. 2023, 15, 3749. https://doi.org/10.3390/rs15153749

AMA Style

Ma J, Guo H, Rong S, Feng J, He B. Infrared Dim and Small Target Detection Based on Background Prediction. Remote Sensing. 2023; 15(15):3749. https://doi.org/10.3390/rs15153749

Chicago/Turabian Style

Ma, Jiankang, Haoran Guo, Shenghui Rong, Junjie Feng, and Bo He. 2023. "Infrared Dim and Small Target Detection Based on Background Prediction" Remote Sensing 15, no. 15: 3749. https://doi.org/10.3390/rs15153749

APA Style

Ma, J., Guo, H., Rong, S., Feng, J., & He, B. (2023). Infrared Dim and Small Target Detection Based on Background Prediction. Remote Sensing, 15(15), 3749. https://doi.org/10.3390/rs15153749

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Infrared Dim and Small Target Detection Based on Background Prediction

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Model Architecture

3.2. Coarse Detection Module

3.3. Fine Detection Module

4. Results

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics Indicators

4.4. Contrast Methods and Parameter Setting

4.5. Contrast Experiment Results

4.5.1. Qualitative Comparison

4.5.2. Quantitative Comparison

4.5.3. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI