IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection

: Infrared small target detection occupies an important position in the infrared search and track system. The most common size of infrared images has developed to 640 × 512. The ﬁeld-of-view (FOV) also increases signiﬁcantly. As the result, there is more interference that hinders the detection of small targets in the image. However, the traditional model-driven methods do not have the capability of feature learning, resulting in poor adaptability to various scenes. Owing to the locality of convolution kernels, recent convolutional neural networks (CNN) cannot model the long-range dependency in the image to suppress false alarms. In this paper, we propose a hierarchical vision transformer-based method for infrared small target detection in larger size and FOV images of 640 × 512. Speciﬁcally, we design a hierarchical overlapped small patch transformer (HOSPT), instead of the CNN, to encode multi-scale features from the single-frame image. For the decoder, a top-down feature aggregation module (TFAM) is adopted to fuse features from adjacent scales. Furthermore, after analyzing existing loss functions, a simple yet effective combination is exploited to optimize the network convergence. Compared to other state-of-the-art methods, the normalized intersection-over-union (nIoU) on our IRST640 dataset and public SIRST dataset reaches 0.856 and 0.758. The detailed ablation experiments are conducted to validate the effectiveness and reasonability of each component in the method.


Introduction
Infrared detectors can detect targets all day long. Even at night, infrared radiation varies depending on the temperature of objects, so the target will have a grayscale difference from its surroundings on the infrared image. In the field of view (FOV) of the infrared search and track (IRST) system, the long-range target contains a small number of pixels, which is generally considered to have an image area of no more than 9 × 9 pixels in the 256 × 256 image according to the definition of SPIE [1]. Detecting infrared small targets is important for military early-warning [2], maritime surveillance [3], etc. However, small scale causes the lack of inherent features of targets, such as shapes, edges, textures, etc. Improving detection rates while reducing false alarm rates has always been a challenging task.
IRST systems usually use multiple frames stitched together to cover a large FOV. The size and FOV of a single-frame image are continuously increased to save time for the wide-area search. Specifically, covering a horizontal 360-degree range, a 320 × 256 infrared detector with the FOV degree of 2.25 × 1.8 requires 200 frames for stitching. Keeping the resolution constant, a 640 × 512 detector with the FOV degree of 4.5 × 3.6 only needs 100 frames. With the same integration time per frame, a detector with the larger size and FOV can save half the total time. However, these improvements bring more interference that can easily be detected as false alarms, making the detection of small targets more challenging. As shown in Figure 1, in larger size and FOV infrared images, it is difficult to distinguish targets from false alarms from local features only. Depending on the images used, infrared small target detection can be classified into single-frame detection and multi-frame detection. In this paper, we focus on single-frame infrared small target (SIRST) detection.
Over the past decades, many methods for SIRST detection have been proposed. Using deep learning as a boundary, these methods can be divided into model-driven and datadriven. Model-driven methods can be further classified as background suppression-based, local contrast measurement (LCM)-based, and optimization-based. Early methods mainly use background suppression, sliding a special window over the image to enhance the target and suppress the background, such as top-hat filter [4], max-median filter [5], etc. However, these methods generate a large number of false alarms when dealing with sea-sky junction lines or heavy cloud clutter. The LCM-based method is inspired by the human visual system, assuming that the target is a local area with a significant grayscale difference from the background. These methods explicitly construct a discriminative measurement that can reflect the characteristics of small targets. The target is detected based on the difference or ratio between the grayscale of the central pixel and the neighboring pixels in the sliding window [6][7][8]. However, due to the long-range attenuation of infrared radiation and the weak radiation intensity of the target itself, small targets in infrared images often have low grayscale values and do not always satisfy the assumptions of the LCM-based methods. From a matrix perspective, the optimization-based approach models SIRST detection as a low-rank sparse matrix decomposition [9][10][11]. Ultimately, these model-driven methods are based on hand-designed features and need to follow some specific assumptions. These inherent shortcomings lead to the poor adaptability of these methods to various scenes in increasingly complex infrared images.
Unlike model-driven methods, recent convolutional neural network (CNN)-based methods have the capability of feature learning in a data-driven manner. Publicly available SIRST datasets have further contributed to the development of CNN-based methods [12,13]. Most of these networks consists of a contracting path that extracts high-level features from the input image and an expanding path that reconstructs the mask for pixel-wise segmentation by the single or multilevel up-sampling procedures. In order to detect targets in larger size and FOV images, it becomes critical to learn features of targets and the background in a larger context area. CNN-based methods use the stacking of convolutional layers to increase the receptive field of the network layer by layer, but every value in the feature map only responds to values within the local receptive field in the previous feature map. This inherent locality of convolution makes it difficult to learn long-range dependencies in the image. In NLNet [14], the self-attention mechanism has demonstrated its powerful ability in non-local feature learning in various computer vision tasks and has been subsequently improved and expanded by other researchers [15,16]. However, these methods never remove the CNN architecture. Self-attention is only used as plug-and-play modules for feature refinement. The transformer is originally used in machine translation to learn the long-range dependency between different word tokens by self-attention layers [17]. ViT first applies it to image classification [18]. Dividing the image into different patches as tokens, the vision transformer for the first time abandons convolution layers to extract image features and instead makes full use of self-attention layers to explicitly learn the longrange dependencies of different patches in the image, opening up a promising direction in the field of computer vision.
Based on the above observation, we purpose a hierarchical vision transformer-based method to detect infrared small targets in larger size and FOV images of 640 × 512, called IRSTFormer. The overall structure inherits the classic encoder-decoder design, learning the pixel-by-pixel segmentation mask in an end-to-end manner for each input image. We design a hierarchical overlapped small patch transformer (HOSPT) to extract multi-scale features from the input image. Image tokens are obtained by the overlapped small patch embedding (OSPE). It also performs down-sampling at different stages to obtain multi-scaled features.
In the decoder, we present a top-down feature aggregation module (TFAM), consisting of the multilayer perceptron (MLP) and the channel-attention block. Adjacent feature maps are fused progressively to obtain the final target segmentation mask. Since the pixel share of the infrared small target is extremely small, after analyzing existing binary cross entropy (BCE) and softIoU loss functions, we exploit a simple yet effective combination of them to optimize the network convergence, called combined BCE and softIoU loss (CBS loss). In addition, we develop a publicly available SIRST dataset of 640 × 512, called IRST640, hoping to promote the further development of this field. In summary, the contributions of the paper can be summarized as: • A hierarchical vision transformer is purposed to detect infrared small targets, which removes the intrinsic shortcomings of existed methods; • A simple yet effective combination of existing loss functions is exploited to optimize the network convergence; • Experiments on public SIRST dataset and our developed IRST640 dataset demonstrate the superiority of our method over other state-of-the-art methods.
The rest of the paper is organized as follows. Section 2 reviews related works about infrared small target detection. In Section 3, we describe the proposed method. Sections 4 and 5 is the experiment part, including results and discussion. Section 6 provides the conclusion and our plan about the future work.

Detection-Based Infrared Small Target Detection
The data-driven CNN is able to learn features adaptively from images and outperforms model-driven methods for the detection of infrared small targets. According to different processing paradigms, CNN-based methods for SIRST detection can be divided into detection-based [19][20][21][22] and segmentation-based methods [12,13,[23][24][25][26][27][28][29]. The detectionbased method outputs the position and scale information of targets directly for the input image, in the same way as generic target detection algorithms, such as Faster RCNN [30] and SSD [31]. ISDet [19] trains both the image filtering network and the target detection network in an end-to-end manner. Du et al. follows the two-stage paradigm of Faster RCNN and designs the small-iou strategy for positive and negative sample partitioning to solve the problem of false convergence and sample misjudgment due to small target size [20]. SSD-ST [21] drops low-resolution layers and enhances high-resolution layers of SSD to adapt the detection of infrared small target. Chen et al. design a two-stage network for target detection in the linear scanning IRST system [22].

Segmentation-Based Infrared Small Target Detection
The processing paradigm of segmentation-based data-driven methods is the same as that of model-driven methods. They utilize networks to make binary prediction pixel by pixel on the input image to obtain the segmentation mask. After that, a threshold is utilized to output the target position and scale information. Most of the segmentation networks use the encoder-decoder structure, with the encoder condensing the image to extract features and the decoder stretching the features to obtain the segmentation mask. The differences of these methods are reflected in model design [12,[23][24][25], feature optimization [26][27][28][29], and feature fusion [13]. Fang et al. converts target segmentation into residual prediction, and the network outputs the background image [23], while training the segmentation network, TBCNet [24] adds a classification network as the semantic constraint to improve the learning ability of the network for image features. MDvsFA [12] and IRSTD-GAN [25] both use the generative adversarial network (GAN) to perform image translation. MDvsFA simultaneously uses two generators to balance missed detection and false alarms. Besides, it publishes a large-scale infrared small target dataset with the image size of 128 × 128. However, the infrared small targets in this dataset do not exactly meet the definition of SPIE. ACM [13] proposes an asymmetric contextual modulation module to fuse the high-level semantics and low-level details. It also publishes a high-quality dataset called SIRST. This dataset includes various backgrounds with the average size of 302 × 221, but only contains 427 images. ALCNet [26] modularizes MPCM [32] as special blocks in the network to achieve a learnable local contrast measurement, which improves the feature extraction capability of the network. RISTDnet [27] uses the hand-crafted feature method as convolution layers with fixed parameters and places them at the beginning of the backbone network to form a segmentation network together with normal parameter learnable convolution layers. AGPCNet [28] and LSPM [29] add non-local modules to networks to perform feature refinement. Both of them use the self-attention mechanism to capture long-range dependency in images, but they are still limited by the locality of convolution layers.

Attention Mechanism
Attention mechanism can be regarded as a resource redistribution mechanism by finding correlations between input data and highlighting important parts. It has been widely used in computer vision and natural language processing. SENet [33] proposes a two-step operation of squeeze and excitation to dynamically modulate the weights of each channel, thus recalibrating the features to improve the representational power of the network. SKNet [34] adds attention branches of different receptive fields on the basis of SENet, realizing adaptive adjustment of receptive fields to the scale of input information. In addition to one-dimension channel attention, CBAM [35] also extracts two-dimension spatial attention from feature maps to re-weight features. NLNet [14] borrows the selfattention mechanism to model the long-distance dependencies in images. To reduce the computational complexity of the non-local module, GCNet [16] proposes a global context modeling module that integrates spatial attention and channel attention into a single module, which can model the global context as efficiently as NLNet and as lightweight as SENet. NLNet has made a preliminary exploration of applicating the self-attention mechanism in deep learning-based computer vision, but like other networks, it only adds plug-and-play modules to the network for feature refinement, without deeply exploring the potential of the attention mechanism. ATAC [36] and AFF [37] utilize attention modules as activation function modules and feature fusion modules in networks, respectively, which essentially suppress useless features and highlight useful features. They can be seen as the exploration of full attention networks.

Transformer for Computer Vision
With the great success of transformers in machine translation and natural language processing [17], the self-attention mechanism has gradually been applied in computer vision. As a pioneer, ViT [18] builds a transformer for image classification, which explicitly models the long-range dependencies between different tokens in an image using self-attention layers. For the first time, it abandons the convolution operation in computer vision tasks, thus avoiding the intrinsic locality. After that, the transformer is gradually used for target detection [38], semantic segmentation [39], and super-resolution [40]. In nnformer [41], the transformer is combined with the UNet for medical image processing. MAResU-Net [42] add the self-attention module to CNN for remote sensing image segmentation. After obtaining image features from CNN, Liu et al. adopt the self-attention mechanism to learn the interaction information of image features in a larger range [43]. Unlike it, our network extracts features by a pure transformer structure and does not utilize the convolutional backbone network. For complex infrared images, it is able to explore the long-range dependencies of different regions more effectively and sufficiently.

Method
In this section, we introduce our method IRSTFormer in detail, a vision transformerbased method for infrared small target detection.

Network Architecture
As shown in Figure 2, our method belongs to segmentation-based infrared small target detection. Given an image of H × W × 1, the network classifies each pixel in the image into target or background, and finally outputs the corresponding segmentation mask. In order to reduce the false alarm in complex infrared images more efficiently, we propose a hierarchical vision transformer HOSPT to extract multi-scale features of CNNs, the self-attention layers in the transformer can learn the dependency relationship in the range of the whole image. This is essential to suppress background interference in complex images. The shallow features contain more target location features that help to locate the target in the image. Deeper features, on the other hand, contain richer semantic features that help to distinguish between false alarms and targets. Therefore, for the decoder, we present the TFAM. In each TFAM, adjacent features are firstly aggregated in the order of top-down. After that, we utilize the channel attention to refine the fused feature. Getting the predicted segmentation mask, we utilize the CBS loss to optimize the network.

Hierarchical Overlapped Small Patch Transformer
Among the existing deep learning methods, deepening CNNs are used to extract features from infrared small target images, but these methods are always limited by the locality of convolution, resulting in the poor ability of modeling long-range dependencies in the images. With the increase in size and FOV of infrared detectors, this deficiency is more likely to lead to detection errors. Therefore, we design a transformer-based encoder HOSPT for feature extraction.
At the beginning of each stage, we design the OSPE to dived the input feature map into different patches and conduct linear projection to obtain the two-dimension feature embedding. During this process, the OSPE also completes the down-sampling of feature maps to realize the multi-scale feature extraction. After that, the dot-product self-attention layer can explicitly model the dependencies between different image patches. The extracted features define the importance of how each patch is similar to other patches in the input feature map. Figure 3 shows the structure of every stage. Every stage consists of four parts: OSPE, self-attention layer, feed-forward network (FFN), and layer normalization (LN). One self-attention layer, one FFN, and two LN constitute one transformer block. Each stage consists of two transformer blocks.
After experimenting with different parameters of the OSPE, we set the patch to 3 × 3 and the stride to 2, which means there is an overlap of three pixels between adjacent patches. Compared with ViT [18], the overlap preserves the continuity between different patches. Specifically, for input three-dimension feature maps of Then, each patch is flattened and projected linearly into 1 × C i+1 . Finally, the output two-dimension feature embedding has the size of N i+1 × C i+1 .
The self-attention layer aims to capture the long-range dependency of every patch pair. As shown in Figure 3, given a feature map, the network learns three sets of parameters to project the features (F) of N × C to query (Q), key (K), and value (V). Then, the weight is obtained by similarity calculation of the query and the key. Common similarity functions include dot-product, splicing, perceptron, etc. The softmax function is used to normalize the weight. Finally, we multiply the weight with the corresponding value to obtain the final attention features. The attention features define the importance of how each patch is similar to other patches in the feature map. For the original standard multi-head self-attention, it makes Q, K, and V have the same size of N × C and calculates the self-attention in the form of dot-product with the following equation: where d head is the dimension. We can see that the computational complexity is quadratic with the size of the feature map, which is prohibitive for large size images. Therefore, the spatial reduction is applied to K and V, which can be formulated as: Then, the linear projection is utilized to restore the number of channels from CR to C. After such options, we obtain K and V of N R × C. As a result, the computational complexity of self-attention is reduced In the FFN, the 3 × 3 convolution is utilized to replace the position encoding. Therefore, the encoder is robust to different sizes of input images as generally found in the segmentation task. The FFN can be formulated as:

Top-Down Feature Aggregation Module
After obtaining the features of four scales, we should aggregate them in a suitable way. In the U-Net, the transpose convolution and the shortcut are utilized to fuse adjacent scaled features. However, this design will double the number of parameters in the network. Considering the number of parameters in the transformer is already more than that of the CNN, we adopt the simple design of the feature pyramid network (FPN) [44]. In the original FPN, features at different scales are fused by linear addition. This unweighted fusion approach may lead to redundancy of information. Therefore, highlighting important features and suppressing useless features is a more appropriate way to aggregate.
We present the TFAM to form a progressive decoder. As shown in Figure 2, according to the top-down order, the features of adjacent stages are fused to obtain the final pixel segmentation mask. The structure of TFAM is shown in Figure 4, during the fusion, the MLP is first used to unify the dimensions of different scaled features. Then, upper-level features are up-sampled and concatenated with lower-level features along the channel dimension. After that, we utilize a convolution layer of 3 × 3 and a ReLU function to reduce the dimension and obtain fused features. Finally, channel attention is used to refine the fused features.
Taking features of C × H × W as the input, we firstly utilize the global pooling for shrinking the feature maps to obtain channel-wise statistics. Next, channel attention, which explicitly models the global information among channels, is obtained after two linear functions and two activation functions. Refined features can be obtained by multiplying the channel attention and the input features. In this way, useful features can be highlighted while useless features can be suppressed. The overall process can be formulated as: CA = σ Linear c 4 →c δ Linear c→ c 4 (CS) where F means the input features of C × H × W, CS means the channel-wise statistics, δ means the ReLU function, σ means the sigmoid function, CA means the channel attention, and F R means the refined features.

Loss Function
Infrared small target detection can be seen as the binary classification of the input image, where each pixel is distinguished as the target or the background. LSPM [29] utilizes the binary cross-entropy (BCE) loss function when training.
where n is the batch size, G is the ground truth, and P is the predicted segmentation mask. However, the pixel area of small infrared targets is extremely small. In our test images, the small target has a pixel share of less than 0.03% ( 9×9 640×512 ≈ 0.00025). Due to the severe imbalance between positive and negative samples, when training, the network that is supervised by the BCE loss can tend to output zeros because even then the loss function is not very large. In other words, the target is overwhelmed by the background. Secondly, there is no prioritization between the target and background, and all pixels in the image are treated equally. At last, the loss of each pixel is calculated independently, ignoring the global structure of the image.
To obtain a better model, we expect the network to focus more on the target region, rather than treating all pixels equally. Intersection over union (IoU) is usually used as the metric for image segmentation, so an intuitive idea is to directly use IoU as the loss function [45]. In ALCNet [26], AGPCNet [28], and DNANet [46], the softIoU loss function is utilized for infrared small target detection, which is defined as where n is the batch size, G is the ground truth, and P is the predicted segmentation mask. However, when supervised by the softIoU loss, our network cannot converge, resulting in no target can be detected. We analyze this phenomenon from the perspective of the gradient.
For analysis, we assume that the network performs the single point output. Consequently, the following equation is used to calculate the loss value.
where x is the output, y ∈ (−1, 1) represents the probability value of a pixel being the target, t ∈ {0, 1} represents the ground truth of the pixel, among which 0 means the background and 1 means the target, and ε is the smoothing factor, which is a very small value.
Using the chain rule, the gradient of the softIoU loss is as follows and shown in Figure 5.  Figure 6 shows the network output values x at the first and middle epoch of the training. Because of the weight initialization, the network output values at the first epoch are concentrated around 0. As shown by the gradient diagrams, the absolute values of the gradient at this time are close to 0 for the negative background sample (t = 0) and the maximum value for the positive target sample (t = 1). This indicates that the contribution of the background region to the network is much smaller than that of the target region, which means that the network is more concerned with finding the target. This tendency will make the network segment more pixels and reduce the missed detection of target pixels. Entering the middle epoch, the network outputs negative and positive values for the predicted background and target regions, respectively. As can be seen from the gradient diagrams, the gradient values at this time both tend to be close to 0. Therefore, the network parameters tend to be updated slowly. Since it has entered the gradient saturation zone, if there are false alarms or missed targets at this time, it will be difficult for the network to overcome these errors, that is, the network is less sensitive to errors. On the other hand, during the training, the value of the loss function varies with the change of the segmentation mask. To ensure smooth training, large changes in the value need to be avoided. In the softIoU loss, a smaller target pixel area (denominator) leads to a larger change in the loss function values corresponding to the same prediction change (change of the numerator), further leading to a drastic gradient change. Once it enters the saturation zone in the early stage of training, it will make the network difficult to converge. Therefore, compared to the generic instance segmentation, a single softIoU loss leads to an instability of the training when the network performs infrared small target segmentation.
We also analyze the BCE loss function from the perspective of gradient. The following equation is used to calculate the loss value.
where x is the output, y ∈ (−1, 1) represents the probability value of a pixel being the target, t ∈ {0, 1} represents the ground truth of the pixel, among which 0 means the background and 1 means the target. According to the chain rule, the gradient of the BCE loss is as follows.
We can observe that the gradient of the BCE loss is equal to the prediction error. The positive and negative samples contribute to the gradient equally. In the early training period, the prediction error is relatively large. Therefore, the gradient value is large. The network parameters can be updated quickly. Entering the latter period, as the prediction error decreases, parameters are updated slower and the network gradually converges to a stable state.
Based on the analysis above, we propose combined BCE and softIoU (CBS) loss, as formulated below: where α = 1, β = 100. It consists of the BCE loss and the softIoU loss. The former mitigates category imbalance, while the latter can provide smooth gradient values. Inspired by Libra RCNN [47], we utilize the natural logarithm to balance the values of two parts. In Section 5.3, we exploit different forms and parameters of the combination. Compared with the weighted addition, the natural logarithm has the ability to adaptively adjust two loss functions at different epochs of the training. The gradient of the CBS loss is formulated as At the early epoch of the training, the value of L BCE is relatively large. Due to adjustment coefficients α and β, the contribution of L BCE to the gradient is reduced, and the gradient mainly comes from L so f tIoU . The network focuses more on the target area. When the training enters late epochs, the value of L BCE becomes small. Due to the effect of adjustment coefficients, the gradient contribution of L BCE is increased. Even if the error occurs, resulting in the saturation of L so f tIoU , L BCE still can supplement enough gradient for network parameters to continue iterating.

Result
In this section, we first introduce the experimental setting including the dataset, evaluation metrics, and implementation details. Then, we compare our IRSTFormer with other state-of-the-art methods to demonstrate the effectiveness of the network. Finally, we show ablation studies on the encoder, decoder, loss function, and dataset of the network to verify the design of the network.

Dataset
Our study is motivated by the fact that infrared images with progressively larger size and FOV make small target detection more challenging. However, existing public datasets do not meet this requirement. Collecting by an IRST system, we develop a synthesized infrared small target dataset of 640 × 512, called IRST640, which contains 1024 images. As shown in Figure 1, the background interference includes the cloud, buildings, and trees. We generate one or more infrared small targets on each real scene image. Zhao et al. have demonstrated the potential of synthesized data for realistic detection tasks [25]. The IRST640 is available on our homepage: https://github.com/jzchenriver/IRST640 (accessed on 5 June 2022).
In the ratio of 8:2, we obtain the training set of 819 images and the test set of 205 images. Since our dataset is collected by an IRST system at a fixed location, the images are of a single scene. The publicly available dataset SIRST has an average image size of 302 × 221. Although smaller than ours, it contains more kinds of scenes. After experimenting based on the mixed and unmixed dataset, we mix the training set of our IRST640 with the training set of SIRST to avoid overfitting caused by the single scene.
The final training set consists of 1160 images, among which 819 images come from our IRST640 dataset and 341 images come from the public SIRST dataset. Two test sets of them keep independent. To be specific, 205 images from the IRST640 dataset and 86 images from the SIRST dataset are used for evaluation separately.

Evaluation Metrics
The network performs detection by segmenting targets from the background, so pixellevel metrics and target-level metrics are utilized to conduct the evaluation simultaneously.
In the terms of pixel-level metrics, we use the normalized IoU (nIoU) and FMeasure (FM) to perform the comprehensive evaluation. They are defined as: where TP, FP, FN, T, and P denote the true positive, false positive, false negative, true, and positive, respectively. The nIoU first calculates the IoU for each prediction and then calculates the mean value for the entire test set. The FM simultaneously considers Precision and Recall. A method is considered to be a good method when obtaining high values (↑) on all of these metrics. Target-levels metrics probability of detection (PD) and false-alarm rate (FA) are also utilized. They are defined as: FA = #numbero f f alselypredictedpixels #numbero f all pixels (23) We consider that the correctness of the prediction depends on whether the centroid distance between it and the ground truth is less than 3 pixels. A method is considered to be a good method when obtaining a high PD (↑) and a low FA (↓).

Implementation Details
Our proposed method is implemented using Pytorch 1.7.0. We resize the image to the size of 512 × 512 as the input. The network is optimized by the adagrad method, where the weight decay coefficient is set to 0.01. The pre-trained weight on the ImageNet is used for network initialization. We train the network for 100 epochs with a batch size of 2. In the first 10 epochs, the learning rate increases linearly from 0 to 0.0005. After that, the initial

Qualitative Results
For the intuitive comparison, Figures 7 and 8 show the segmentation masks of the total of 14 methods corresponding to the same image.
For the complex cloud background shown in Figure 7a, neither the background suppression-based methods nor the optimization-based method detects the target. Although TLLCM and RLCM can detect the target, both are less effective in suppressing false alarms. All the deep learning-based methods are able to accurately detect targets from complex backgrounds, indicating that their feature extraction ability learned from training data is robust to heavy cloud layers.
In Figure 7b-d, the ground scene is more complex than the sky scene. The deep learningbased methods show false alarms. In particular, in Figure 7c, only ResNetFPN, DNANet, and our proposed IRSTFormer accurately detect the target and no false alarms appear.
In Figure 7e, the target luminance is weak. Furthermore, there are a large number of noisy pixels. The detection results of the deep learning-based methods are overall inferior to the traditional methods. MDvsFA detects both the target and the noisy pixels, so the FA value is high. Even so, our proposed method achieves target detection with zero false alarms.
In summary, benefiting from the feature learning capability of networks, the detection results of deep learning-based methods are much better than that of traditional methods.    Among the three subclasses: CNN, GAN, and Transformer, the GAN-based MDvsFA has a higher false alarm rate although it is more capable of detecting targets. For cloud scenes with relatively simple features, the CNN-based method has a strong detection capability. However, once faced with more complex ground scenes, it cannot suppress false alarms well due to the locality of convolution. With the increase in the FOV and size of infrared images, this defect causes more detection errors. The transformer-based approaches explicitly model the dependencies between different image patches through the self-attention mechanism, thus having strong ability to distinguish between targets and false alarms. Compared with Segformer, our proposed IRSTFormer has improvements for the encoder, decoder, and loss function, therefore achieving both improvements in detection rate and reduction in false alarm rate.

Quantitative Results
The quantitative results of different methods on IRST640 and SIRST are shown in Table 1. Table 1. The comparison results on the IRST640 and SIRST dataset with pixel-level and target-level metrics.

IRST640
SIRST Our proposed IRSTFormer achieves optimal results on all metrics, which proves the effectiveness of the method. Specifically, deep learning-based methods significantly outperform traditional methods in both pixel-level metrics and target-level metrics. This is because traditional methods cannot learn to extract features from the image. Moreover, many complex scene images in the test set do not satisfy the prior assumptions of traditional methods, such as buildings, trees, clouds, etc. This leads to poor results of these methods, among which RLCM and PSTNN achieve relatively good results.
For deep learning-based methods, the input size is all of 512 × 512. The PD of all methods exceeds 0.84, but the detection results of GAN-based MDvsFA show a large number of false alarms, which lead to a low level of pixel-level metrics FA. For CNN-based DNANet and AGPCNet, in the experiments, the softIoU loss used in the original paper does not enable the network to converge properly. We analyze the reason for this phenomenon in Section 3.4. After changing the loss function to the proposed CBS loss, the training proceeds normally. It is worth noting that LSPM, borrowing the idea of self-attention, has suboptimal detection results, but it only use the self-attention as feature optimization modules in the network. Compare with it, we construct a feature extraction network that is entirely based on the self-attention mechanism, taking full advantage of its ability to model long-range dependencies. On the IRST640 dataset, the nIoU and FM are improved by 0.018 and 0.012. On the SIRST dataset, which contains more complex scenes, the most noticeable improvement is on the PD metric, which increases by 0.055. Compared with the transformer-based Segformer, the improvement acquired by our method in both nIoU and FM is close to 0.1. PD continues to improve while FA drops by an order of magnitude.
In Table 2, based on a computer with one Nvidia RTX 2080ti GPU, we show the computational complexity (GMac), the number of parameters, and the processing speed (FPS) of different networks when a 512 × 512 image is input. Compared with the suboptimal LSPM, our method achieves an improvement in all experimental metrics, while the computational complexity and the number of network parameters decrease significantly. In the future work, we will focus on how to improve the processing speed of the IRSTFomer.

Discussion
As mentioned earlier, our proposed IRSTFormer consists of three parts: HOSPT encoder, TFAM decoder, and CBS loss. In this section, we discuss our method in detail. Each experiment is trained for 50 epochs.

Ablation Study
Using Segformer with BCE loss as the baseline, we show the enhancement of each of these three parts. Tables 3 and 4 show the results on two datasets.  On the IRST640 dataset, after changing the encoder into the HOSPT, nIoU is improved by 0.037. It indicates that the design of overlapped small patches is suitable for segmenting the infrared target on an extremely small scale. Based on this, the TFAM and CBS loss continue to improve the detection performance. Especially, the CBS loss is highly effective for false alarm suppression. Compared to the baseline, our method acquires an increase of 0.095 in nIoU, 0.068 in FM, and 0.012 in PD. Furthermore, FA drops by more than 100 times. The performance on the SIRST dataset is similar. The target-level metric PD is improved by nearly 10% and FA decreases by nearly 3 times. Three parts improve the nIoU by 0.064, 0.003, and 0.012, respectively, which demonstrates the effectiveness of OSPE in the transformer encoder. Another pixel-level metric FM shows the same trend. In the next section, we show the influence of different parameters in the OSPE.

Different Parameters in the OSPE
In this section, we experiment with different parameters in the OSPE. The results are shown in Tables 5 and 6. As mentioned above, the HOPST has four stages. The OSPE is located at the beginning of each stage. It divides the feature maps into patches and projects them to the twodimension embedding. There are two parameters: patch size and stride size, among which the stride size determines the down-sampling times. Considering the target size, we set the stride to 2. Therefore, the downsampling times of the last stage is 16. It is worth noting that, compared with the Segformer that downsamples 32 times, this small change results in a significant improvement in detection metrics. The experiments on both datasets show the same trend. When the size of the patch drops from 7 to 3, the detection performance gradually improves. We suggest that this is related to the pixel size of infrared small targets in the images. When the patch and stride size is 2 and 2, it means there are no overlapping pixels between adjacent patches. From the result, we can see that the presence of overlapping pixels leads to better results. This indicates the importance of preserving the continuity between patches.

Different Forms of Combination of the BCE and SoftIoU Loss
After analyzing the existing loss functions for infrared small target detection, we believe that both BCE loss and softIoU loss have their own drawbacks. When the size and FOV of infrared images increase, these drawbacks will lead to more detection errors. We exploit the suitable form to combine them. The simplest form is the weighted addition (WA), as formulated below, In addition, inspired by LibraRCNN [47], we also use the natural logarithm (NL) to balance the values of two parts, which is formulated as, The results are shown in Tables 7 and 8. The trend of the experimental results on the two test sets is similar, with the best performance coming from the last set in the table-two parts are combined by the natural logarithm with α = 1, β = 100. When the softIoU loss is utilized, the network cannot converge during the training, and all metrics are zero. We analyze the cause of this phenomenon in Section 3.4. For the weighted addition and natural logarithm, the latter group has better results. We suggest that the natural logarithm function has the ability to adaptively adjust the trend of the loss value at different epochs of training. Meanwhile, the nonlinearity of the function improves the ability of the network to fit nonlinear data. Because the values of two functions differ by many orders of magnitude, in the respective groups, the best results simultaneously come from α = 1, β = 100.

Different Training Sets
We conduct experiments on two datasets: our proposed IRST640 and publicly available SIRST. Table 9 shows the results with different training datasets. Since our IRST640 is collected by an IRST system at a fixed location, the images are of a single scene. As the result, if we use only the training set of IRST640, even though the detection performance is excellent on its own test set, it performs poorly on the SIRST test set. This suggests that the network has a tendency of overfitting. When trained with the SIRST, the network has better generalization capability. However, the segmentation performance still needs to be improved. After the mixing, compare with the IRST640 only, a significant increase is achieved on SIRST, although there is a slight decrease in performance on its own test set. Compare with the SIRST only, benefiting from the increasing number of training images, the performance on both test sets gains noticeable improvement. It indicates that the mixed training set can endow the network with stronger generalizability.

Conclusions
In this paper, we propose a novel vision transformer-based method for single-frame infrared small target detection, called IRSTFormer. Different from existing methods, this network adopts the pure transformer design for the encoder. Making full use of the self-attention mechanism, the proposed HOPST can learn long-range dependencies in increasingly complex images. For the decoder, a compact module TFAM is presented to perform the feature aggregation progressively. Furthermore, we purpose the CBS loss to supervise the optimization of network parameters after analyzing traditional loss functions in detail. This simple yet effective loss function can bring significant improvement for false alarm suppression. Compared with state-of-the-art methods, our IRSTFormer acquires the best pixel-level and target-level detection performance. The nIoU and detection rate reaches 0.758 and 0.991 on the public SIRST dataset. On our developed IRST640 dataset, our method also has the optimal result. Exhaustive ablation studies demonstrate the effectiveness and reasonability of each component in the method.
In the future, we will focus on the deployment of IRSTFormer in the edge device and explore how to utilize a transformer on multi-frame images for infrared small target recognition.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflicts of interest.