AVILNet: A New Pliable Network with a Novel Metric for Small-Object Segmentation and Detection in Infrared Images

: Infrared small-object segmentation (ISOS) has a persistent trade-off problem—that is, which came ﬁrst, recall or precision? Constructing a ﬁne balance between of them is, au fond, of vital importance to obtain the best performance in real applications, such as surveillance, tracking, and many ﬁelds related to infrared searching and tracking. F1-score may be a good evaluation metric for this problem. However, since the F1-score only depends upon a speciﬁc threshold value, it cannot reﬂect the user’s requirements according to the various application environment. Therefore, several metrics are commonly used together. Now we introduce F-area, a novel metric for a panoptic evaluation of average precision and F1-score. It can simultaneously consider the performance in terms of real application and the potential capability of a model. Furthermore, we propose a new network, called the Amorphous Variable Inter-located Network (AVILNet), which is of pliable structure based on GridNet, and it is also an ensemble network consisting of the main and its sub-network. Compared with the state-of-the-art ISOS methods, our model achieved an AP of 51.69%, F1-score of 63.03%, and F-area of 32.58% on the International Conference on Computer Vision 2019 ISOS Single dataset by using one generator. In addition, an AP of 53.6%, an F1-score of 60.99%, and F-area of 32.69% by using dual generators, with beating the existing best record (AP, 51.42%; F1-score, 57.04%; and F-area, 29.33%).


Introduction
Infrared small-target applications are widely exploited in many fields such as maritime surveillance [1][2][3][4], early warning systems [5][6][7], tracking [8][9][10][11][12][13], and medicine [14][15][16][17][18], to name but a few. Heat reveals a critical salient characteristic in the local background of infrared images, and it could be a conclusive, distinct clue for effortless object segmentation. Nevertheless, infrared small-object segmentation (ISOS) is a challenging task because heavy background clutters and sensor noise as shown in Figure 1. Suppose local patch-based ISOS handcraft methods [19][20][21] perform the segmentation task in those scenarios, i.e., under heavy clutter or sensor noise. What happens next in the confidence map? Essentially, this gives rise to umpteen false alarms from employing too narrow a receptive field, perceiving local visual-context information, and using too few filters when learning from several cases. Therefore, that condition makes it hard to distinguish the foreground from various complex background patterns.
Viewed in this light, the performance in the ISOS task is contingent on the learning ability, that how well it distinguishes out the complex background patterns surrounding the objects of interest, and not the objects themselves. Specifically, this tendency occurs more often from following three major characteristics of infrared small objects. First, owing to either a long distance from the sensor or its actual size, the object appears very small in the image and its inner pattern convergences to one point. This phenomenon makes almost all category classification impossible. Second, following the first effect, if objects We hypothesize that one generator is more efficient than two in terms of direct information interchange via perceptrons for one goal, i.e., ISOS. At the end of various analytical experiments, we succeed in obtaining what we call the Amorphous Variable Inter-located Network (AVILNet). This successful network harmoniously tunes the balance between recall and precision quite well.  [22]: On the left are two generators (G1 and G2), and on the right is one discriminator. In the two generators, the blue number within each layer is the dilation factor. For the discriminator, the height, width, and channel number of the output feature maps are marked beside each layer. The two generators compose the dual-learning system concentrating on opposing objectives while sharing information (e.g., L1-distance between G1 and G2 feature maps) with the discriminator to alleviate radical bias training.
AVILNet is comprised of various methods following the latest fashions, that achieved the SOTA in each of their fields, e.g., GridNet [24], Cross-Stage-Partial-Net (CSP-Net) [25], atrous convolution [26] (dilation convolution), and Dense-Net [27]. Aside from that, other related methods are exploited for the various experiments in Section 5.2. The contributions of this paper can be summarized as follows.
(1) We introduce AVILNet, inspired by GridNet [24]. This deep CNN was designed for its increasing learning ability, and for distinguishing the foreground from a complicated background. In addition, this network finds the global optimum very quickly (within two epochs). It means AVILNet may provide better performance with more extensive training data. To alleviate the gradient vanishing problem which is intensified by a complex deep structure, we adopted CSP-Net [25] with dilation convolution [26] in AVILNet.
(2) To obtain a suitable model and establish its validity, we performed various experiments, step by step, building a hypothesis and confirming it. Plenty of experiments show that our method was achieved elaborately, not by coincidence.
(3) We discreetly introduce the novel metric F-area, which multiplies AP by the best F1-score. It is a trustworthy metric when true negatives are greater than 99% in an image, compared to area under the receiver operating characteristic (ROC) curve (AUC). In addition, it simultaneously considers performance in terms of real application and potential capability.
(4) We compared AVILNet with relevant state-of-the-art small-object segmentation methods on public data from ICCV2019 ISOS. The results demonstrate the superiority of AVILNet, beating the existing best record.

Related Work
ISOS methods can generally be categorized into two groups, depending on whether they are CNN-based or not. Recently, fully connected network (FCN) [28] based methods have been applied to ISOS [7,22,[29][30][31]. In particular, DataLossGAN [22] and Asymmetric Contextual Modulation (ACM) [31] have obtained magnificent performance on their datasets. Meanwhile, handcraft-based methods [6,20,21,[32][33][34][35][36][37][38] have also been studied. Researchers have analyzed local properties in scenes to distinguish the foreground from the background, patch by patch, and have applied a few well-designed filters to whole images in a sliding window manner. Both groups, finally, adopt a mono-or adaptive-threshold policy to produce a gray scale confidence map.
In this section, we briefly pinpoint and visit the few properties and problems for ISOS in terms of CNN structures with related successful studies.

Deeper Network vs. Vanishing Gradient
In a broad sense, the ISOS task is classified into two objectives. The first is background suppression; the second is intensification of the objects of interest. Most traditional approaches [6,20,21,[32][33][34][35][36][37][38] execute the two objectives sequentially with only a few filters, which are elicited through human analysis. CNN-based methods perform the whole task under latent space comprised of numerous parameters in the network regardless of the sequence. Each kernel of a CNN, as a unique filter, is aimed at suppressing background clutter or sensor noise, or to enhance the object of interest.
Consequently, owing to the limits of manual methods, the handcraft-based methods construct only a few filters by observing several cases. This strategy gives rise to a phenomenon whereby generalization performance is decided from the rules, by which the training data determine how well the selected cases represent the distribution of all the test data.
In comparison, the CNN-based methods [7,22,[29][30][31] have numerous filters, which are sufficient to cover the distribution of all the test data. This is why the CNN is superior in terms of absolute performance measure, than the handcraft-based methods.

Attention and the Receptive Field
A deeper network, comprised of multi-layers, could be a solution for increasing the learning capability. However, it is too hard to preserve the features of small infrared objects in deep layers, so that re-awakening strategies, which remember prior or low-level layer feature information, have been studied [24,31,44,45].
Luong et al. [44] proposed global and local attention for neural machine translation. The global attention takes all prior memories and the local attention takes several prior memories at a current prediction. In the computer vision, generally these attention concepts have been utilized in a slightly different way. The global attention refers to all prior feature maps, but the local attention refers to several prior feature maps to a current layer. The definition of receptive field is the size of the region in the input that produces the feature map [46].
In segmentation, deep CNNs should be able to access all the relevant parts of the object of interest to understand context. That is why multi-scale approaches are commonly taken to accelerate the performance [24,31,45,47].
GridNet [24] exploits a grid organization with a two-dimensional architecture for multi-scale attention. The grid organization has two advantages for architecture. The one is that GridNet can simultaneously consider the problem of setting the attention and the receptive field by two parameters (i.e., width and height). The other is that GridNet follows the rule of grid, so the produced model meets always a grid shape. The second advantage makes less an effort to set the detailed configuration for layers. Furthermore, GridDehazeNet [45] ameliorates GridNet by replacing the residual block [40] with a dense block [27], and replacing simple addition with attention-based addition at each confluence in the network.
Chen et al. [48] successfully adopted dilation convolution of the semantic segmentation field [26]. Its benefit is that it does not increase the computational parameters, but spreads the receptive field, and concomitantly alleviates lattice patterns caused by overlap accumulation. Dai et al. [31] proposed the ACM for ISOS. This module takes a two-directional path (top-down and bottom-up) to preserve prior feature information.
We analyzed GridNet [24] for diverse experiments because of its pliable structure. Furthermore, GridDehazeNet [45] was chosen as the baseline.

Data-driven Loss and Ensemble
DataLossGAN [22] proposed data-driven loss to obtain a delicate balance between MD and FA. DataLossGAN consists of two generators, in which they concentrate on mutually incompatible objectives, and in the test phase, decision making is done in an ensemble manner. This study inspired us, so that AVILNet adopts both data loss and the ensemble manner. However, in our case, we construct a sub-network within the main network. A detailed explanation is in Section 3.2.2.

Generative Adversarial Network
A generative adversarial network consists of two parts: a generative network (generator) and a adversarial network (discriminator) [49]. In the training step, the goal of the generator is to learn to generate fake data following the distribution of real data. Meanwhile, the aim of the discriminator is to learn to distinguish between fake data and real data.
Auto-encoder is a generalized approach for image-to-image segmentation [47,50,51]. If an auto-encoder is considered as one generator, a generative adversarial network is available. Especially, in our case, input vector Z ∼ latent space is replaced with input vector Z ∼ R W,H,C , so that the whole model can be regarded as conditional GAN (cGAN) [52].
DataLossGAN [22] also used this methodology and exploited a discriminator to mitigate the output difference between two generators. Our proposed training strategy dose not share the outputs via the discriminator, but has direct connections within the generative network.
Our generative network can be divided into two networks. One is the main network and the other is the sub-network. Therefore, the generative network operates as an ensemble manner for the final decision making the confidence map.

Resizing during Pre-Processing
Generally, most CNN-based semantic segmentation studies have focused on commercial usage [48,53]. Therefore, they do not need to consider small-sized objects within a small image. This history leads to resizing the input image to go through a large-scale backbone for small-object applications [31,54,55]. In the field of ISOS, ACM [31] adjusts the size of the input image to 512 × 512, and then, randomly crops one segment, which has a size of 480 × 480, from within the whole image area. On the other hand, DataLoss-GAN [22] adopted a patch training strategy in which input patches are sized at 128 × 128. We experimented with both the above strategies to make an impartial observation of each effect.

The Proposed Method
This section discusses handling the above issues with our approach. After that, our proposed model overview is shown. Following that, details of the structure and the loss function are presented.

Our Approach
To find the solution for ISOS, we conducted an experiment to determine what architecture is truly proper for ISOS. Most backbone research, unfortunately, was performed on the assumption that a large-scale object is in a large image [40,56]. In addition, the laminated backbone structure is not pliable enough for executing experiments with various configurations. For this reason, our first objective was finding a flexible backbone for a varied receptive field or multi-scale attention, either way. The GridNet [24] structure perfectly corresponds to the objective as shown in Figure 3. We set GridDehazeNet [45] with a modified loss function as our initial study due to fact that they published their clean code. GridDehazeNet is based on GridNet, too, but replaces feature addition with attention-based addition. This method boosts performance. The difference between D8 and D17 in Table 1 is only that of exploiting attention-based addition or not. Therefore, we set the GridDehazeNet as the baseline. AVILNet is inspired by GridNet [24]. This 'grid' structure is extremely pliable. In terms of network perception, the horizontal stream (black arrow pointing right) enforces local properties and the vertical stream (black arrow pointing downward) enforces global properties. To transform the grid structure, we only have to set the two parameters ('width' and 'height').

The Amorphous Variable Inter-Located Network
In this subsection, we do not deal with the experiments in detail, but concentrate on AVILNet itself. As illustrated in Figure 4, AVILNet consists of one generator and a discriminator, like cGAN [52]. It is a common strategy, but the inner architecture of the generator is novel. Our most valuable choice was adopting a grid structure as the development backbone. The greatly pliable organic architecture led us to various experiments. As shown in Table 1, AVILNet (denoted as D8) was obtained after plenty of attempts. Increasing the width of the generator improves the representational ability of the shape, the local context, and the semantic information. Conversely, height improves the receptive field, the global context, and the deep semantic information. As a consequence, AVILNet is flexible in responding to the given task by changing height and width. In short, AVILNet is amorphous and variable.
(a) AVILNet: Generator. In this picture, height and width are 6 and 4, respectively. Note that the network is rotated 90 degrees to the left for easier viewing. The black whole numbers are the output feature map size of each layer. We set the depth of the stream channel (denoted as st) of our generator at 34.
(b) AVILNet: Discriminator. The black numbers to the right of each layer are kernel size, stride, and padding, and the other side is the output feature map size.

Cross-Stage Partial Dense Block and Dense Dilation Block
The Q1 baseline only use a dense block [27]. Due to the duplicate gradient information [25], it wanders in the local minimum. The details of Q1 are shown in Table 2. To overcome this problem, we apply the cross-stage partial strategy (CSP) to the dense block of Q1. Therefore, it is called the cross-stage dense block (CDB). The original CSP strategy, which records the best performance on a large-scale object dataset [57], showed rather poor performance compared to the last-fusion strategy in our case. This is because a secondary transition layer distorts the feature information of a small object. We tackle this issue by changing the original CSP with last-fusion within the CDB. Furthermore, following successful application of dilation convolution [22,48], we propose the cross-stage partial dense dilation block with last-fusion (CDDB-L) as shown in Figure 5. This boosts our task from the suitable usage proportion.   [25], we take the last-fusion strategy where, in the final processing, addition is modified to concatenation. The dilation layer in CDDB-L improves the quality of the segmentation task through alternative sampling [48]. The ablation study for observing the effects of diverse strategies is shown in Table 2. The blue number within the layer is the dilation factor. Table 2. Ablation study, step by step, in terms of the discriminator, the activation function, the extension of gw and nd, CSP strategies, dilation, and the blocks (Dense, Res, ResNext). The baseline Q1 set the gw and nd at 16 and 4, respectively. The bold denote the best scores. The growth and the number of dense layers in CDB-L or CDDB-L are decided by the hyper-parameter gw and nd. The dilation factor at the θth layer in CDDB-L is calculated as

Multi-Scale Attention-Based Ensemble Assistant Network and Feature-Highway Connections
One of the main contributions is that we successfully analyzed the feature-highway connections and formulated its structure. The overview of feature-highway connections is shown in Figure 6. Following the last-fusion strategy, feature-highway connections are consequentially built within the generator to flow information smoothly without congestion. In particular, it works to assist the regularization of object shape and context at the final decision in an ensemble manner, like ResNext [41]. The equations representing the above theory can be expressed as follows. Let ζ, x, and y be the group residual block, the input, and output, respectively: Equation (2) explains one ResNext [41] block formula. Then, our generator denotes height and width as h and w, respectively. In addition, in a CDB-L or CDDB-L, the feature maps of the generator in a stage are split into two parts through channel x = [x 0 , x 1 ]. The x 0 undergoes the feature-highway connections, and y 0 is the output from the connections. In that case, we can set a i,j , D i,j , b i,j , U i,j , x, and y to denote a feature vector corresponding to the i, j coordinates down-stream, to denote the down-sampling block (DSB), a feature vector corresponding to i, j coordinates up-stream, an up-sampling block (USB), input, and output, respectively. For simplicity, we leave out the weighted sum operation, and we replace symbols of the x 0 , y 0 with x, y. Note that the i, j coordinates indicate each addition confluence of the grid. Details on the DSB and the USB are shown in Figure 7. Following that, our feature-highway connections can be expressed as: On downstream, .
On upstream, Finally, we get Equation (5): Instead of simple binary up-sampling, two convolutional processes operate. This strategy makes our feature-highway connections trainable.
Following Equation (5), we can illustrate the overview of our proposed method, i.e., a multi-scale attention-based ensemble assistant network (MEN), as shown in Figure 8. This obtained latent network gives us profound insight into how our generator preserves the low-level information of small objects at a high level without any short connections. In comparison to ResNext [41], the MEN has attention-based flow, which is represented by the interconnected black lines in Figure 8. It encourages inter-exchange of semantic information between each decision group. In addition, ResNext [41] performs in an ensemble manner by through channel-wise grouping, but our method exploits all the channels. The former obtains an advantage in terms of reducing computational resources, but the blocks (ζ), have nothing to communicate to each other. ResNext [41] concentrates on improving performance at the decision level in an ensemble manner, rather than an attention manner. On the other hand, our generator has much smaller cardinality (ResNext [41] set the cardinality at 32 by default, but our generator set the cardinality at 2.) than the former, while having plentiful inter-connections for attention. As shown in Table 3, the results of the experiment verify that our method (MEN) is more suitable for the ISOS task. To sum up, we regard the MEN as a sub-network supporting the main network.

Over-Parameterization
Allen-Zhu et al. [58] explained the relationship between network parameters and learning convergence time. In detail, they suggested that if input data do not degenerate, i.e., any x ∈ R dim , x i = x j where i = j, and the network is over-parameterized, it can indeed be trained by regular first-order methods (e.g., SGD) to the global minima. This theorem can be expressed as follows: In convolution neural networks...
where T is the number of iterations; O is the symbol for time complexity; n, L, and d are the number of samples in the training dataset, the number of layers, and the number of pixels in the feature maps; δ denotes the relative distance between two training data points, and ε is the objective error value. Furthermore, Allen-Zhu et al. [59] compared the relationship between the number of layers and hidden neurons. Following the two conclusions, our AVILNet generator is designed to have many more trainable parameters and layers than other state-of-the-art methods. Thanks to this strategy, our method obtains wonderful convergence speed, as shown in Table 4. AVILNet educes maximum performance within two epochs.

Formulation
AVILNet consists of three parts: adversarial loss, L2 loss, and a data loss. The data loss was adopted from a successful study [22]. In cGAN [52], two networks (i.e., generator G and discriminator D) apply adversarial rules to their counterparts. The generator makes a fake image to pool the discriminator, while the discriminator distinguishes it from ground truth. For the discriminator, adversarial loss can be expressed as: where I and I GT denote input image and ground truth, respectively. For the generator, the L2 loss, L 2 , and the data loss, L data , can be expressed as: The L2 loss L 2 and data loss L data can be expressed as: where P means G(I), and λ MD and λ FA are hyper-parameters to control the balance ratio of miss detection to false alarm [22]. The ⊗ operator means element-wise multiplication. Finally, the complete objective of ours network can be expressed as: in which α 1 , α 2 , and α 3 are the algorithmic coefficients decided by heuristic experiments. The setting details are provided in the implementation details in Section 4.2. DataLoss-GAN [22], to make up for L2 loss fault, separates it into two segments in terms of miss detection and false alarm. Finally, L 2 is replaced with L data . As a result, there is no L 2 . However, is that sufficient to boost performance? Our ablation study shows that in a specific setting when the growth rate is 24, L 2 encourages training stability. Consequently, it leads to better performance.

Implementation Details
In this section, we describe our whole network architecture in detail. Our main proposal network is the generator because in the real applications, the generator performs discretely from the whole network. We provide the configurations and the input/output dimensions table of our networks. Our generator, which is based on GridNet [24], has a complex structure so that we do not deal with it deeply. While this sub-section concentrates on that which are practically changed compared to the baseline [45].

Generator
The overview of our proposed generator is shown in Figure 4a, and the segment modules are shown in Figures 5 and 7. The configurable hyper-parameters are stream (st), growth rate (gw), number of dense layers (nd), width (w), height (h), learning rates (lr), λ 1,2 , and α 1,2,3 . We set them at 34, 24, 7, 4, 6, 10 −4 , (100,10), and (1,10,100), respectively. The st decides the depth of feature map at the first floor. The gw and the nd decide the growing channel depth and the number of dense layers in CDB-L and CDDB-L like the DenseNet [27]. The w and h decide the width and height of our generator. To improve the understanding about the roles of width and height of our generator, we added the Figure 9. In Figure 4a, every block on the same floor (height) releases the same shape of feature map by padding, represented by the same color of arrows. Before network processing begins in earnest, the input image goes through the shallow convolutional layer (kernel size = 3, stride = 1). Similarly, before it predicts the confidence map, the last feature map goes through a shallow convolutional layer, too. The former works the channel-expanding module and decides the capacity of information flow all through the network. Contrarily, the latter fuses information from the last feature map and makes a final decision (i.e., a prediction).
Each time it passes through the DSB, the feature map's depth is doubled and the width and height are each halved. Conversely, passing through the USB, the depth of the feature map is cut in half, and the width and height are doubled. The ratio for the CDDB-L is calculated with Equation (11). Note that the units for the CDDB-L ratio is columns.
The number of CDDB-L columns = (width − 1)//3 The selective strategies are resizing (RS), shuffle (SF). Resizing is shown in Figure 10, and shuffle is shown in Figure 11. In resizing, we take the route for case 3. Each route in resizing does not show a dramatic performance gap, but the shuffle strategy is not the same. We cautiously adopted the direct through shuffle strategy as the end process within every CDB-L, and CDDB-L because of the validity of our sub-network, as explained in Section 3.2.2. The exact configurations of each block in the generator are shown in Table 5 and the input/output dimensions are shown in Table 6. Table 5. Block details for CDB-L, CDDB-L, USB, and DSB. The growth rate and the number of dense layers for the CDB-L and CDDB-L are gw = 24 and nd = 7, respectively.

Discriminator
Our discriminator was motivated by U-GAT-IT [60], which exploits spectral normalization [61] to stabilize the training of the discriminator. Likewise, we applied spectral normalization to the end of each fully connected layer within the discriminator. The one average-pooling layer and the one max-pooling layer are settled at the front of each fully connected layer to escape the input-size dependency. This is a decisive difference, compared with DataLossGAN [22]. The whole structure of DataLossGAN is illustrated in Figure 2. As a result that there is no apparatus for compressing the feature map, Dat-aLossGAN is dependent on the input-size. The whole structures of our discriminator are illustrated in Figure 4b. The specific configurations and the input/output dimensions of the discriminator are shown in Table 7.

Experimental Results
In this section, we compare AVILNet with other related state-of-the-art ISOS methods and one small-object detection method, (YoloV4 [23]). In addition, plentiful results from the ablation study prove our approach is reasonable.

Methods in Comparison
We compared AVILNet with two groups: CNNs and handcraft-based methods. In the CNN group, two networks were not originally designed for the ISOS task. DeepLabV3 [48] is a generic large-object segmentation method, and YoloV4 [55] is a detector following the single-stage, one-shot affiliation method. They are based on the CNN, but are not designed for ISOS. Nevertheless, they are SOTA in their own fields. To obtain extensive discernment, they were included in CNN group. The other methods, DataLossGAN [22] and ACM [31], were designed for ISOS from beginning to end, and they achieved incredible performance on their datasets. For a non-partisan comparison, all methods were measured on the ICCV 2019 ISOS Single dataset only. The other group (handcraft-based) consisted of IPI [6], MPCM [32], PSTNN [37], MoG [62], RIPT [36], FKRW [33], NRAM [38], AAGD [34], Top-hat [13], NIPPS [35], LSM [19], and Max-median [21].

Hardware/Software Configuration
The experiment was conducted on a single GPU (RTX 3090) with a 3.4 GHz four-core CPU. All the handcraft-based methods were implemented in Python or Matlab and all the CNN-based methods were implemented in Pytorch. All weights within the networks underwent initialization [63], and all models were trained from scratch. For a fair competition, we singled out the best results from among all epochs for every method. Each method had the maximum number of epochs set to 130. Overall detailed settings of all methods are shown in Table 8.

Datasets
Wang, Huan et al. [22] published the ICCV 2019 ISOS Datasets consisting of two parts: AllSeqs and Single. The Single dataset contains 100 real, exclusive infrared images with different small objects, and AllSeqs contains 11 real infrared sequences with 2098 frames. The published version of the datasets is slightly different. The Single dataset remains unchanged, but AllSeqs is not provided as raw images, but as 10,000 synthetic augmented frames. The synthetic augmented frames consist of 128 × 128 image patches which are randomly sampled from the raw images. This corresponds to configuration I in [22]. Accordingly, we conducted experiments under configuration I, in which, AllSeq was used for training and Single was used for test. Some Single examples are shown in Figure 1. All the training samples in AllSeqs have a size of 128 × 128. To increase the difficulty of the tasks and ensuring an accurate evaluation of generalization capability, the backgrounds in the test images are not seen in the training images [22]. The detailed description about the ISOS datasets is given in Table 9. Table 9. ISOS datasets: Nos. 1-11 are the eleven sequences in "AllSeqs", dataset, and No. 12 is the single frame image dataset "Single".

No.
Name Size Frames/Images

Evaluation Metrics
In a CNN-based ISOS method, each binary pixel value indicates confidence in the object's existence. Following [22], we set the threshold value at 0.5 to make a binary confidence map in all methods. In short, pixels with a score of less than 0.5, were ignored. In that state, precision-recall, the receiver operating characteristic, and F1-score were obtained by increasing the threshold value by 0.1. Finally, this measured data were used to obtain the average precision and area under the ROC curve, with the new F-area metric added to weigh the capability of real applications. Precision, recall, true positive rate (TPR), false positive rate (FPR) are calculated using Equation (12) TP, FP, TN, and FN stand for True Positive, False Positive, True Negative, and False Negative. F1-score is calculated using Equation (13)

New Metric: F-Area
F1-score is used to measure harmonious performance. This is valuable for real applications, since the threshold value must be fixed when they operate. While operating with a fixed threshold value, the methods cannot educe full potential performance (i.e., average precision). To consider two cases simultaneously, we introduce a new metric: F-area. This metric takes into account both harmonious and potential performance aspects of any technique. F-area is simply obtained as the product of two values, expressed as:

Comparison with State-of-the-Art Methods
Evaluations of the performance by all other state-of-the-art methods are reported in Table 4 and are illustrated in Figure 12. In addition, some output results (denoted as a to e) are shown in Figures 13 and 14. Our proposed AVILNet outperformed all the existing CNN-based and handcraft-based ISOS methods and detectors (i.e., YoloV4) in terms of all metrics. From the results of over-parameterization, AVILNet converges to a global minimum within just two epochs-overwhelmingly fewer iterations than the other methods (refer to Figure 15). On the other hand, ACM-FPN [31] recorded the second-best performance, despite having the fewest parameters and the minimum computing cost, although it recorded the highest number of iterations (91 epochs). That is an inevitable consequence based on the trade-off. In particular, DeepLab V3 [48] recorded the worst performance from among the CNN-based methods. Since DeepLab V3 was designed for large-scale objects, it is hard to forward small-object information to deep semantic layers. In addition, Yolo V4 was designed for large-scale objects, too. Nevertheless, how did it obtain such a fine performance? The answer to this question is that the backbone of Yolo V4 is based on the cross-stage partial network. CSP tends to forward information well into the deep semantic layers. AVILNet also exploits CSP to surmount the vanishing information problem. Meanwhile, all the handcraft-based methods obtained immature performance. These phenomena come from the overwhelming lack of filters to distinguish objects from the background, compared to the CNN-based methods.

Ablation Study
In this section, we explore the following questions to ensure that the contributions of our proposed model components are reasonable. Then we answer the all questions point by point.

Question 1.
Does thin pathway really block information flow? Question 2. In ISOS, is resizing the input image instructive? Question 3. Does L 2 stabilize the regularization from over-parameterization? Question 4. Height vs. width. What is more important? Question 5. Are there just more dilation layers for better performance? Question 6. How effective is the cross-stage partial strategy? Question 7. What is the best shuffle strategy in last-fusion? Question 8. Feature addition vs. attention-based feature addition. What is the best? Question 9. Is the ratio of λ MD,FA suitable for our task? Question 10. Is mish activation better than leaky ReLU activation in our task? Question 11. Is the learning rate we used the best? Question 12. Is the dual-learning system superior than the single in our case?
All hyper-parameters and selective strategies, except for the discriminator, were changed variously for the ablation study. We performed the massive experiments and divided them into four branches by subject. One is single (denoted as D), another is dual (denoted as T), a third is MEN, and the other (denoted as Q) is a step-by-step approach. Therefore, our plenty of experiments gave us considerable insight into finding the best way as shown in Tables 1 and 10, and especially, Table 2, which clearly confirm the effectiveness of each contribution, step by step. Specifically, Table 1 is the experiments for the single network, Table 10 is for the dual networks, Table 3 is for the MEN, and Table 2 is for the step-by-step approach. The notations D index , T index , and Q index indicate the corresponding relevant Tables 1, 2, and 10, respectively. Especially, Table 2 clearly shows the effectiveness of each sectional idea, from the baseline to the proposed method. In this section, we skip the expression to indicate each table, and instead, we directly designate each index (e.g., D1, T1, Q1, etc.). All the graphical results for the each tables are shown in Figures 16-18.

Question 1. Thin Pathway Blocks Information Flow
D1 has the thickness of a stream at 8, recording the best performance after 123 epochs. This slowest convergence-time result stems from the fact that this thin pathway blocks information flow. Likewise, T4 has a stream thickness of 16, recording the best performance after 21 epochs. It is the slowest convergence-time in group T (Table 10). To sum up, a pathway that is too narrow, compared to the amount of information within each block, not only creates a gradient bottleneck, but also leads to pool performance.

Question 2. In ISOS, Resizing the Input Image Is Adverse to the Performance
To confirm the effectiveness of the resizing strategy, we compared D2 with D3, and D17, and D18 with D8. D2 and D20 follow the route for case 1, and D19 follows the route for case 2 (see Figure 10). At first, for the training phase, resizing the input image is not beneficial. Then, for the test phase, resizing the test image is really deleterious. The route for case 1 reduces the performance of F-area (Fa) by up to 14%. The route for case 2 reduces it by 7.9%.

Question 3. L 2 Slightly Stabilizes Regularization from Over-Parameterization
L 2 is shown to be ambiguous in our case. We observed an anticipated phenomenon. This loss function consistently reduces performance on the subsets: (D3, D4), (T5, T6). However, subsets (D8, D21), and (T7, T8) show the opposite results. This is because of the difference in the number of parameters. The former have fewer trainable parameters than the latter. D3 and T5 have 62.08M and 76.92M parameters, respectively, while D8 and T7 have 80.98M and 161.96M parameters, respectively. The important decision in GridNet [24] is setting the ratio of the height to width of the network. The height mostly affects the maximum receptive field; contrarily, width affects the maximum stacked layer of each floor. To confirm if height or width is more important, we conducted experiments D5 and D11. In D5, we replaced the height with the width of D3. As a result, D5 had better pool performance than D3. Additionally in D11, to observe the effectiveness of the receptive field, we reduced the height of D8 by half, which reduced the performance by 22.7%. In conclusion, the height has to be longer than width of the network. 5.2.5. Question 5. Only the Proper Ratio of CDB-L to CDDB-L Can Improve the Performance DataLossGAN [22] obtained successful performance by adopting a dilation layer [48]. Motivated by this success, we applied dilation layers in AVILNet. CDDB-L has dilation layers, as shown Figure 5. Then, how many CDDB-Ls are suitable? D6 and T1 exploit only CDDB-L. As a result, compared to D5, the performance of D6 decreased by 11%, but compared to T3, the performance of T1 increased by 3.5%. To concentrate on a single network study, we took the strategy following Equation (11). 5.2.6. Question 6. The Cross-Stage Partial Strategy Is Imperative to Overcome the Problem of Gradient Vanishing One of the main contributions is applying CSP [25] in AVILNet. In ISOS, information vanishing from within a deep CNN is a significant issue [22,31]. To tackle this, we applied CSP to each dense block with last-fusion (denoted as CDB-L). This strategy alleviates the issue, and dramatically improved performance. To confirm our success, we conducted an experiment for D8 without CDBs (shown as D12). Compared to D8, the performance of D12 decreased by 24%. This result indicates that adopting CSP is essential.

Question 7. The Direct-Through Shuffle Strategy Is Superior To the Other Strategies
In adopting CSP, we took the last-fusion strategy instead of the original for embedding the MEN described in Section 3.2.2. D9 and D10 selected shuffle, and grain-shuffle, respectively-the two strategies shown in Figure 11b,c. Compared to D8, denoted as AVILNet (Single), their performance decrease by 12% and 14%, respectively. These results underpin the superiority of our MEN.

Question 8. Attention-Based Feature Addition Is Better than Simple Addition
Liu, Xiaohong et al. [45] proposed GridDehazeNet. This network replaced feature addition in GridNet [24] with attention-based feature addition. The difference between them is that the latter is able to use the weighted sum from trainable weights, as shown in Figure 19. D17 shows the significance of the weighted sum. When D17 took the simple addition strategy, it decreased by 13%, compared to D8. In our analysis, attention-based feature addition assumes a role discerning the information that is flowing on a grid-way. Therefore, this allows AVILNet to focus more on important information. Ratio λ MD,FA is needed for data loss [22]. In D15 and D16, we set λ FA at 3. This value is as low as 7, compared to D8. Therefore, they can concentrate more on reducing miss detections rather than false alarms. However, this attempt decrease performance by 16% and 18%, respectively. As a result, we followed the default λ MD,FA setting from [22]. Dual network T2 (which followed the setting in [22]) showed pool performance. Therefore, we set the λ FA at 1 for a dual network.

Question 10. Mish Activation Is Better Choice Than Leaky ReLU
Misra et al. [64] proposed mish activation. This successfully surpasses the leaky ReLU (denoted as lrelu) [65]. We replaced lrelu with mish in AVILNet. Then, we observed performance increasing by 8.3%. This experiment is shown as D18. 5.2.11. Question 11. The Learning Rate at 0.0001 Is the Best To adjust the learning rate, we conducted experiments D13, D14, and D18. Finally, we set the learning rate at 10 −4 . This is ten times smaller than the baseline's learning rate. 5.2.12. Question 12. The Dual-Learning System Enhances the AP and AUC Well, But Increases the Fa Slightly T8 consists of two D8 networks. The difference of T8 and D8 is only the ratio of λ MD,FA . As D16 concentrates λ MD more than λ FA and T8 did not record the better performance, we can conclude that the dual-learning system has advantages over the single.

Applying the Experiment Step by Step
To see the effectiveness of each idea at a glance, we conducted the experiments denoted with Q. Additionally, we replaced all dense blocks [27] of our generator with Resnet [40] or ResNext [41] to ensure that the dense block [27] is the best choice. All results are shown in Table 2. Note that this sub-section is conducted to observe the effectiveness of each block. (Do not be confused by the sub-network experiment in Section 3.2.2, and Table 3) To improve the understanding about the problem of gradient vanishing, we have to notice the relationship between Q3 and Q4. Q4 has gw and nd at 34 and 7, respectively. While Q3 has gw and nd at 16 and 4. For Q4, The number of epoch for convergence decreases at 8 to 4. The whole performance deteriorates because Q4 encounters the gradient vanishing problem. Q5 shows that cross-stage partial strategy is sufficient to overcome the problem of gradient vanishing. Furthermore, we modified the original strategy of CSP to last-fusion so that our AVILNet (Single) achieved the best score. The experiments Q6 and Q7 are performed to confirm that any other blocks (e.g., Residual block and ResNext block) are valid. The results of them denote that dense block is superior to residual block and ResNext block in our task. Additionally, the result of Table 3 denotes our sub-network (MEN) is remarkable than ResNext.

Summary and Conclusions
In this paper, we proposed a novel network, AVILNet, to solve the infrared small-object segmentation problem. Motivated by the necessity for a flexible structure, we constructed the network based on GridNet [24]. Due to its amorphous characteristics, we could conduct various experiments to find the best optimal structure. This is a great benefit in terms of saving time to design the network. In addition, our multi-scale attention-based ensemble assistant network performed remarkably well when forwarding low-level information of small objects into the deep high-level semantic layers. It operates as an independent feature extractor within the network, and is derived from the last-fusion strategy. A plenty of ablation study and analysis underpins the logical validity of the structure in AVILNet. In addition, to measure performance, taking into account both the harmonious and the potential capability, we introduced a new metric: F-area. The advantage of F-area is striking on the infrared small-object segmentation. In the infrared small-object segmentation true negatives are greater than true positives overwhelmingly, therefore the receiver operating characteristic displays very high levels of score. Therefore, the comparison method using the receiver operating characteristic is meaningless than other evaluation metrics (e.g., average precision, F1-score). In this situation, our F-area can be a sensible evaluation metric. Finally the performance results demonstrate the superiority of AVILNet in terms of all metrics in the infrared small-object segmentation task.

Abbreviations
The following abbreviations are used in this paper: