IRSDet: Infrared Small-Object Detection Network Based on Sparse-Skip Connection and Guide Maps

: Detecting small objects in infrared images remains a challenge because most of them lack shape and texture. In this study, we proposed an infrared small-object detection method to improve the capacity for detecting thermal objects in complex scenarios. First, a sparse-skip connection block is proposed to enhance the response of small infrared objects and suppress the background response. This block is used to construct the detection model backbone. Second, a region attention module is designed to emphasize the features of infrared small objects and suppress background regions. Finally, a batch-averaged biased classiﬁcation loss function is designed to improve the accuracy of the detection model. The experimental results show that the proposed small-object detection framework signiﬁcantly increases precision, recall, and F1-score, showing that, compared with the current advanced detection models for small-object detection, the proposed detection framework has better performance in infrared small-object detection under complex backgrounds. The insights gained from this study may provide new ideas for infrared small object detection and tracking.


Introduction
With the development of infrared image-sensor technology, infrared spectral imaging technology has provided new information for object-detection tasks [1,2].Currently, the object detection method based on infrared images is one of the best methods for detecting remote thermal objects because the infrared features of objects are more noticeable than their visible features [3].In remote detection tasks, most infrared objects are considered small objects because of fewer pixels, a lower signal-to-clutter ratio (SCR), unclear contours, and sparse texture features.Because of these characteristics, infrared small-object detection remains a significant challenge.
Convolutional neural networks (CNNs) provide a broader perspective on object detection.Compared to traditional methods, CNN-based object detection methods can adaptively learn object locations and semantic information in sample images, resulting in higher accuracy and robustness.Object detection models based on CNN include two-and one-stage models.The former are not suitable for high real-time detection because of the slow inference speed that divides positioning and classification into two steps, such as RCNN [4].The latter, such as YOLO [5] and SSD [6], have a fast inference speed and good accuracy.Some optimized CNN-based models have a good detection capacity for small objects.ResNet [7], DenseNet [8], and ResNext [9] propose shortcut connections that can transfer information by skipping one or more layers to address the degradation problem.This is helpful in reducing the feature loss of small objects during information transmission.In DetNet [10], downsampling blocks in deep layers are eliminated to preserve the resolution of high-level feature maps, which can improve the positioning accuracy of small objects.DSSD [11], RSSD [12], and FSSD [13] propose specific multiscale feature fusion methods to suppress the static noise in low-level feature maps.In RFBNet [14], multiple branches with different kernels and dilated convolution layers are concatenated to expand the receptive field and enhance the deep features of lightweight CNNs.Extensive studies have shown that the above methods can improve the accuracy of small-object detection but still do not achieve satisfactory results.One of the most important factors is that the methods do not optimize the model structure specifically for the characteristics of small objects, such as size and texture.
Many researchers have been inspired by small-object detection methods and have proposed detection models suitable for small objects.The optimized methods of these models can be categorized into spatial-temporal information fusion [15][16][17], residual/background information prediction [18,19], optimized region proposal [20,21], and multiscale information fusion [22][23][24][25].The spatial-temporal information fusion method reduces static noise by combining adjacent frames in an infrared image sequence.The residual/background information prediction method is an indirect method that first predicts the background information and then subtracts it from the original image to obtain the object's position.Traditional methods or CNN-based methods are used in optimized region proposal methods to filter the potential region of the object.Subsequently, a classifier is designed to process the potential region image.
In summary, these studies support the notion that there are many essential differences between visible and infrared objects, such as the number of image channels, image signalto-noise ratio, and the number of hard negative samples.Therefore, infrared small-object detection methods are different from visible small-object detection algorithms; the former focuses on reducing false alarms, while the latter aims to reduce misdetections.
An infrared small detection framework called IRSDet is proposed to address these issues.The main contributions of this study are as follows.

1.
A sparse-skip connection module is proposed to construct the backbone that can reduce the feature loss of infrared small objects in information transmission.

2.
A feature map enhancement method based on the region attention mechanism is proposed to reduce background noise interference and emphasize the objects' potential region.

3.
A batch-averaged biased classification loss method under limited memory usage is proposed to alleviate the drastic fluctuation of classification loss under the small-batch configuration and avoid the gradient explosion of the focal loss function in the initial training process.
Experimental results showed that the proposed method has high precision and recall.The insights gained from this study may provide new ideas for infrared small object detection and tracking.

Small-Object Detection Methods Based on CNN
In recent years, the optimization methods of CNN-based small-object detection models have been divided into the following aspects: Receptive field and attention mechanism: Sun et al. [26] proposed a mask-guided SSD.The method enhances features with contextual information and introduces segmentation masks to eliminate the background regions.However, segmentation masks, including the object region, require pre-labeling.FD-SSD [27] adopts deformable convolutional layers that can optimize the position of the receptive field to better adapt to the geometric and shape changes of small objects, but they increase the computational cost.Lim et al. [28] proposed FA-SSD, which uses a residual attention module and context information to enhance the feature representation of low-level feature maps, thus improving the accuracy of small-object detection.
Multiscale information fusion: Cui et al. [29] proposed a multiscale deconvolutional SSD (MDSSD).The method can simultaneously upsample high-level feature maps of different layers and fuse them with non-adjacent low-level feature maps to form a clearer feature for small objects.Zhai et al. [30] proposed a DF-SSD that designed a backbone network based on dense connections.To enhance the representation of features in the output image, adjacent feature maps are fused to supplement semantic information and details.Although dense connections can suppress feature loss, they retain a large amount of static background noise, which is a severe problem for infrared small-object detection.Pan et al. [31] proposed a top-down feature fusion module that iteratively fuses high-level features containing semantic information with low-level features containing boundary information.
Additionally, data augmentation methods were considered to preprocess small-object samples to improve the training effect of the detection models.Kisantal M et al. [32] proposed a sample replication method to increase the number of small objects in each image to address the issue of a small number of positively matched anchors.Bai Y et al. [33] proposed a super-resolution small-object generation method, SOD-MTgan, which can upsample small, blurred objects to recover more details.

Infrared Small-Object Detection Methods Based on CNN
Spatial-temporal information fusion: Park et al. [15] proposed an infrared smallobject detection method for pedestrian image sequences that manually introduces spatialtemporal information and potential object regions.To avoid position errors caused by residual and mask images, adjacent similar pixels are merged into a single object using the connecting component algorithm.To eliminate the influence of static noise in an infrared image sequence on the detection of small infrared objects, Yao et al. [16] proposed an optimized FCOS network model that uses traditional filtering methods and spatialtemporal feature fusion to preprocess sample images.Du et al. [17] proposed an interframe energy accumulation (IFEA) enhancement mechanism to effectively extract spatial-temporal information in the infrared sequence.The method is specially designed to suppress strong spatially nonstationary clutter, enhance the object, and improve accuracy.
Residual/Background information: Shi et al. [18] proposed a convolutional and denoising autoencoder network (CDAE) that uses residual images as output images.Additionally, perceptual loss is employed to solve the problem of background texture feature loss in the encoding process, and structural loss is proposed to compensate for the perceptual loss defect in which small objects appear.This method was supported by Fang et al. [19], who stated that too many details are lost during the pooling operation in the downsampling of the encoding process; thus, it is difficult to reconstruct the high-frequency details well in the decoding stage.To address this issue, they proposed a multiscale U-Net.The constructed image-to-image network integrates the global and local dilated residual convolution blocks into the U-Net, predicting the residual information between the input and output images for small infrared UAV object detection.
Optimized region proposal: Fan et al. [20] proposed an infrared small-object detection method based on region proposal and a CNN module to separate real objects from the background and significantly reduce the false alarm rate caused by complex background clutter.First, the small-object intensity is enhanced based on the local intensity characteristics.Potential object regions are then proposed by corner detection to ensure a high detection rate of the method.The approach used in the research by Ren et al. [21] is similar to that described above.They designed a simple structured region context network (RCN) to extract possible regions.Then, an optimized GAN network is used to process region images to generate super-resolution results with more detailed features.

Multiscale information fusion:
The downsampling of CNN-based models may cause information loss, decreasing the accuracy of detecting small infrared objects.Ding et al. [22] and Du et al. [23] used high-resolution low-level images as feature maps to address this issue.Moreover, multilevel feature-fusion methods are used to suppress false alarms in low-level feature maps.Ju et al. [24] went one step further.They used an hourglass image-filtering module to obtain a fusion image to substitute the original input image, aiming to enhance the response of small infrared objects and suppress the background response.However, this module directly processes the original image without distinguishing background noise from objects.Hou et al. [25] adopted a more efficient method for replacing the input image with a fusion image.The framework in their research used parallel convolutional layers to extract the contrast information of small objects and neighborhoods.The kernels of the parallel convolutional layers have different sizes for extracting different-scale spatial information.
Training strategies: Some studies have optimized training strategies for small-object detection.For instance, Du et al. [23] specially designed an IOU threshold and anchor size for small objects.Bai et al. [34] proposed a regular constraint loss (RCL) to restrict multiscale feature fusion learning and obtain more accurate object location information.

Proposed Method
Small infrared objects exhibit three characteristics.First, the feature of small objects is apt to lose in information transfer.Second, limited by the performance of current infrared sensors, the signal-to-noise ratio of infrared images is low, and there are numerous false alarms.Finally, most infrared objects lack texture and detailed features.
This study, therefore, proposes an infrared small-object detection model (IRSDet) to address these issues.The structure of IRSDet is shown in Figure 1.The SSD is used as the detection head in the proposed method.In this section, we first discuss the structure and function of the backbone and then describe a feature enhancement method based on the proposed guide block.Finally, we describe the batch-averaged biased classification loss function.Small infrared objects are more challenging to spot than small visible objects.First, there is only one information channel in infrared images, and many false alarms have features similar to real objects.Second, serial convolutional layers can enhance the information extraction of the surrounding receptive field.The output signal of one of the serial convolutional layers cannot sufficiently extract the features of the real objects.Consequently, the subsequent convolutional layers cannot obtain accurate information.Bias accumulates over multiple convolutional layers, resulting in a severe loss of object features in deep layers.
A sparse residual block (SRes block) is proposed to address these issues.The SRes block is an alternative parallel structure that can transmit signals on parallel branches.The signals of the two branches are combined after each convolutional layer.The structure of an SRes block is shown in Figure 2.  The n-th CBA × m can be defined as function F n (), and the output is X n .Therefore, the output of the SRes block is: The output of the typical residual block is: The output of the SRes block is related only to the outputs of the two adjacent convolutional layers.However, the output of typical residual blocks is related to all the previous layers.We consider that the excessive use of low-level information brings much background noise and, therefore, reduces the detection accuracy.

Adaptive Receptive Field Block
An adaptive receptive field (ARF) block is designed as the first convolutional layer to better adapt to the size change in the objects.The ARF block adopts parallel convolution kernels with different atrous rates (Figure 3a).These convolution kernels can collect features from regions of different sizes in an input image.They are then concatenated to create a large kernel with sparse receptive fields (Figure 3b).Thus, more features can be extracted to adapt to the geometric and shape changes of an object.We added a 1 × 1 kernel convolutional layer in the parallel stage to counteract the overlap in the center of the large convolution kernels.Convolutional layers with different atrous rates can adjust the weights of different regions.
ReLU can suppress negative signal transmission, resulting in the feature loss of small objects.Therefore, an extra branch is appended to transmit negative signals.The positive and negative signals are concatenated and then transmitted to a 1 × 1 convolutional layer with two functions: (1) to optimize the internal weight allocation of the sparse large convolution kernel to increase its sensitivity to object features; and (2) to reduce the dimensions of the output maps and suppress less-valuable channels.

Extremum Pooling
Strided convolution and pooling methods are currently the most popular downsampling methods.However, with the strided convolution and Avg-pooling, it is hard to avoid decreasing the contrast between small objects and neighborhoods.Conversely, Max-pooling can adaptively select the maximum grayscale from the region and exhibits outstanding performance in transferring semantic information.However, a specific drawback associated with Max-pooling is that it might ignore the critical details of small objects in shallow convolutional layers.The results of the downsampling are shown in Figure 4a.
We propose an optimized downsampling method called extreme pooling (Ext-pooling) to address this issue.Ext-pooling (Figure 4b

Feature Enhancement Based on Attention Mechanism
We utilized low-level images as feature maps to improve the recall of small objects.However, low-level feature maps have undesirable noise because of the complex clutter backgrounds.Multilevel feature fusion methods were used in [22,23] to suppress background noise; however, computational costs were proportionally increased.To reduce the interference of false alarms and noise, this study proposes a region attention mechanism block, namely, the guide block.
The structure of the guide block is shown in Figure 5. Max-pooling and Avg-pooling were used to process the branch feature maps.The former was used to recover the contour of an object's potential region, which may lose information transmission, and the latter was used to suppress noise and smoothen the image.Two processed images were then combined by multiplication.Finally, a CBA module was appended to eliminate redundant information from the image, thereby generating a guide map.Potential object regions have high weights in the guide map, whereas the background region has weak weights.

3×3 Conv BN, Relu
Guide Map The guide map is used to activate the corresponding original feature map by elementwise multiplication, aiming to enhance the response of infrared small objects and suppress the response of the background.The output image was processed through an additional 3 × 3 convolutional layer to adjust the grayscale information distribution.Note that the feature map of L3 has more background noise, making it difficult to generate an accurate guide map.To address this problem, we processed the feature map of L3 layer using the guide map of its adjacent L4 layer.We adopted a bilinear interpolation algorithm and 1 × 1 convolutional layer to adjust the feature map resolution and number of channels, respectively.

Batch-Averaged Biased Classification Loss
The confidence loss of SSD, L con f , is the softmax loss over multiple classes confidences (c).It is defined as x p ij = {1, 0} is the indicator for matching the i th default box to the j th ground-truth box of category p; N is the number of prior boxes matching ground-truth boxes.Ground truth is the category of each object in the image and its real bounding box.
There are many hard samples in infrared small-object images, and how to distinguish them is a critical issue.To improve the capacity to detect hard samples, Lin et al. [35] proposed an adaptive weight classification loss called focal loss.The focal loss is defined as pt ∈ [0, 1] is the model's estimated probability for the class with label y = 1; γ ≥ 0 is a tunable focusing parameter.
However, the critical issue is that the characteristics of infrared and visible images are different, which makes the typical focal loss perform poorly in the training of infrared small-object detection models.To solve this problem, we propose a batch-averaged biased classification loss (Ba loss) based on focal loss.
First, the extreme class imbalance of positive and hard-negative signals encountered during the training of detectors is a central issue.In response, we set a scale factor β to adjust the proportion of positive and negative samples involved in calculating the final classification loss function, thereby suppressing the excessive interference of hard-negative samples in the model training process.For instance, if there are N positive examples after classification, we sort negative examples using the highest confidence loss for each anchor box and pick the top β • N examples.These positives and negatives are used to compute the final classification loss.
Second, at the beginning of the training process-limited by the performance of the initial detection model and characteristics of the sample images-it is difficult to avoid several classification errors.These classification errors enormously increase the classification loss value and significantly affect or even terminate the training of the detection model.To address this issue, we added a small bias factor to L f l , to avoid gradient explosion.In this study, the bias was set to 1 × 10 −3 .The optimized L f l is: Finally, the batch size per iteration was not sufficiently large because of the model and hardware memory size limitations.Thus, the classification loss in successive iterations is volatile, particularly in infrared datasets with complex scenes.It is unreliable to evaluate the detection accuracy of the model using a single-batched classification loss in the later training period.To solve this problem, a smoothing method was adopted in this study to adjust the weights of the classification loss of multiple batch samples.The latest confidence loss has a large weight because it reflects the current situation of the model; early confidence losses have low weights.The modified confidence loss is

Dataset
According to the definition of SPIE, an object with less than 80 pixels in an image of 256 × 256 pixels is a small object.The dataset [36] selected in this study contained 15,546 images in which the objects were small fixed-wing UAVs.The dataset acquisition scene covered the sky, ground, and a variety of complex scenes.Some of the images in the dataset are shown in Figure 6.The size distribution of objects in the experimental dataset is shown in Figure 7.A total of 82.2% are below 20 pixels, 10.8% are 20∼40 pixels, 4.3% are 40∼60 pixels, and 2.7% are 60∼80 pixels.The dataset contained 21 scenes, and the training and test sets were divided based on the serial number of scenes to ensure that the sample ratio was 4:1.The training and test sets contained data from different backgrounds, and the details are listed in Table 1.

Experiments Settings
The experiments in this study were run on Ubuntu 20.04, and the deep learning framework was PyTorch 1.8.1.The GPU was 11 GB RTX3080Ti.We used the cosine decay method to adjust the learning rate in the training process.The initial learning rate is 1 × 10 −3 , which finally decreases to 1 × 10 −7 .The number of training iterations was set as 160,000.The batch size was set as 8. β in the loss function was set as 14.We used the k-means method to cluster the size of ground-truth boxes of the dataset and then preset anchor box parameters to accelerate the reduction of regression loss.

Evaluation Criteria
Infrared images tend to have more false alarms compared to visible images.Thus, visible small-object detection tasks focus on FN, whereas infrared small-object detection tasks should consider FP.To address this issue, precision, recall, and F 1 Score were used as the evaluation criteria in our experiments for infrared small-object detection.
TP is true positive, FN is false negative, and FP is false positive.The loss curve is a time series curve, and we use Moving Standard Deviation (MSD) and Moving Average (MA) as the evaluation criteria for Ba loss and typical focal loss curves.MSD is used to evaluate curve smoothness, and MA is used to evaluate curve trends.The formulas for MSD and MA are represented as: where L is the moving window, N is the length of L, and A i is the point i on curve A.

Ablation Studies
This section assesses the functions of the proposed blocks.We used the SSD as the baseline and modified the backbone, feature enhancement method, and loss function according to the method proposed in this study.A comparison of the detection methods with different configurations is presented in Table 2. First, to assess whether the proposed evaluation criteria were rational, we plotted the precision, recall, mAP, and F 1 Score in Table 2, as shown in Figure 8.Further analysis showed that the trend of mAP was consistent with the recall trend but was not sensitive to changes in precision.F 1 Score was sensitive to changes in both recall and precision.Therefore, this study used F 1 Score instead of mAP as the evaluation criterion.
It is apparent from Table 2 that IRS16 can transmit more details about small objects.Compared to No. 1, No. 5 showed an 18.1% decrease in FN, thus resulting in a 2.4% increase in recall.The result is significant that the guide maps have improved the precision and recall of the detection model.FN and FP significantly decreased when using guide blocks as feature enhancement methods, regardless of VGG16 or IRS16.Some of the detection results and corresponding guide maps are shown in Figure 9.Moreover, the batch-average-biased classification loss function was more conducive to the detection of small infrared objects.This effectively improved the recall rate of the detection model.The results of Nos. 3 and 7 show that the recall rates of VGG16 and IRS16 increased by 4.0% and 4.6%, respectively.Overall, these results indicate that each block proposed in this study can improve the accuracy of the detection model.A comparison of SSD512 (No.1 in Table 1) and IRSDet (No. 8 in Table 1) shows that the latter is more suitable for detecting small infrared objects.Remarkably, the F 1 Score of the IRSDet significantly increased by 5.4% compared with that of SSD512, reaching 96.6%.

Different Configurations of the Proposed Model
In this section, we changed the IRS16 configuration and feature enhancement method.The experimental results are listed in Table 3.The models in Table 3 adopted the classification loss proposed in this study.Feature extraction: Comparison of the results for 1, 2, and 3.The detection model using the serial block has many FPs, which means that the serial block will lose the texture of the objects, weakening the difference between the noise and objects.In contrast, the residual and SRes blocks have lower FPs and can improve the accuracy of the detection model.Remarkably, the excessive use of low-level information of the residual blocks introduced background noise and, therefore, did not substantially decrease the FPs and FNs.
Down-sampling: Comparison of the detection results of 3, 4, and 5. Max-pooling and Ext-pooling could improve the precision of the detection models.However, the number of FPs in the latter was 45% lower than that in the former.The detection model using convolutional downsampling is inferior to the other detection models.This indicates that convolutional downsampling is inappropriate for infrared small-object detection.
Feature enhancement method: It is apparent from Table 3 that the detection model using FPN has the least number of FNs compared to the other detection models, which reveals that multiscale feature fusion can combine the object information in multiple feature maps to enhance the features of real objects.However, it also stresses the characteristics of static noise, resulting in an undesired increase in the FPs.

Convergence Analysis of Gradient Descent
This study compared the focal loss with the proposed classification loss function.We used the default classification loss function of SSD to pretrain the initial detection model to avoid gradient explosion owing to focal loss at the beginning of the training process.The number of iterations of pretraining was 40,000, and the batch size was set to 8. Subsequently, the focal and Ba loss functions were employed in the model.The number of iterations was 40,000, and the batch size was set to 8. The learning rate was 1 × 10 −3 .The results are shown in Figure 10.The curve of the classification loss function proposed in this study is smoother than that of the focal loss function curve and has a faster convergence rate.
The experimental results show that in the training process of the infrared small-object detection model, the batch-averaged method can effectively solve the loss fluctuation problem caused by the limitation of GPU memory.Moreover, the scale factor of positive to negative helps the detection model eliminate the learning dilemma owing to the extreme class imbalance between positive and negative samples.Using these methods, the model can focus on distinguishing between hard samples, thereby reducing the loss value for the detection model.

Comparison of Advanced Detection Models
Clearly, from Table 4, the method in this study performed well at infrared small-object detection.The TPs, FNs, FPs, and precision of the proposed model reached a suboptimal level, and the recall, mAP, and F 1 Score reached an optimal level.Using high-resolution feature maps inevitably reduces the inference speed of the model; however, it decreases the FNs and FPs.Some of the detection results are presented below.It can be seen from the above table that the method in this paper has a good performance in infrared small object detection.The proposed model's TPs, FNs, FPs, and precision reached a suboptimal level, and the recall, mAP, and F 1 Score reached an optimal level.The use of high-resolution feature maps inevitably reduces the inference speed of the model but also decreases FNs and FPs.Some detection results are shown in Figure 11.

Conclusions
This study proposed an infrared small-object detection framework based on deep learning to improve the detection capacity for small objects such as drones and vehicles in complex backgrounds.First, we proposed a backbone that uses sparse skip connection and the optimized downsampling method to enhance the feature representation of small objects.Then, we proposed a feature enhancement module based on the attention mechanism to filter potential object regions.Finally, the classification loss function was modified to improve the detection accuracy for infrared hard samples.A small public infrared dataset was used to evaluate the detection model.The experimental results show that the IRSDet proposed in this study performed better than the other advanced small-object detection methods.The precision and recall rates were 98.8% and 94.6%, respectively, and the F 1 Score reached 96.6%.
This paper provides deeper insight into research in the field of infrared object detection and tracking.The limitations of this study are that we did not optimize the location loss function, and the inference speed of the current detection models was not sufficiently fast.Therefore, our future research direction is to explore the position loss function suitable for small infrared objects and determine an efficient combination of traditional and deep learning methods.

Figure 1 .
Figure 1.Proposed infrared small-object detection framework.The backbone has 16 weight layers, and L2-L5 layers have 2, 3, 3, 3 SRes blocks.The first layer of the backbone is a parallel convolutional layer with different atrous rates.

Figure 3 .
Figure 3. (a) shows the ARF block.(b) shows the sparse convolutional kernel.The convolutional layers in the parallel stage have different atrous rates.

Figure 5 .
Figure 5. Structure of the guide block.

Figure 6 .
Figure 6.Some images in the experimental dataset.Red boxes mark the real locations of objects.The resolution of the images is 256 × 256.

Figure 7 .
Figure 7. Sizes of objects in the dataset.

Figure 8 .
Figure 8. Experiment results of different methods in Table2.FE means proposed feature enhancement method, and OL means proposed classification loss.

Figure 9 .
Figure 9. (a) Detection results and (b) corresponding guide maps.The guide maps highlight the object regions.

Figure 10 .
Figure 10.Comparison of different classification losses.MA = Moving Average; MSD = Moving Standard Deviation.

Figure 11 .
Figure 11.Detection results of advanced detection models.

Table 1 .
Division of the dataset.

Table 2 .
Experiment results of detection methods with different configurations.
bold number: Optimal result.

Table 3 .
Different configurations of the proposed model.

Table 4 .
Comparison of different detection methods.