LR-TSDet: Towards Tiny Ship Detection in Low-Resolution Remote Sensing Images

: Recently, deep learning-based methods have made great improvements in object detection in remote sensing images (RSIs). However, detecting tiny objects in low-resolution images is still challenging. The features of these objects are not distinguishable enough due to their tiny size and confusing backgrounds and can be easily lost as the network deepens or downsamples. To address these issues, we propose an effective Tiny Ship Detector for Low-Resolution RSIs, abbreviated as LR-TSDet, consisting of three key components: a ﬁltered feature aggregation (FFA) module, a hierarchical-atrous spatial pyramid (HASP) module, and an IoU-Joint loss. The FFA module captures long-range dependencies by calculating the similarity matrix so as to strengthen the responses of instances. The HASP module obtains deep semantic information while maintaining the resolution of feature maps by aggregating four parallel hierarchical-atrous convolution blocks of different dilation rates. The IoU-Joint loss is proposed to alleviate the inconsistency between classiﬁcation and regression tasks, and guides the network to focus on samples that have both high localization accuracy and high conﬁdence. Furthermore, we introduce a new dataset called GF1-LRSD collected from the Gaofen–1 satellite for tiny ship detection in low-resolution RSIs. The resolution of images is 16m and the mean size of objects is about 10.9 pixels, which are much smaller than public RSI datasets. Extensive experiments on GF1-LRSD and DOTA-Ship show that our method outperforms several competitors, proving its effectiveness and generality.


Introduction
Object detection [1][2][3] in remote sensing images (RSIs) aims to locate objects of interest (e.g., ships [4,5], airplanes [6,7] and storage tanks [8,9]) and identify corresponding categories, playing an important role in urban planning, automatic monitoring, geographic information system (GIS) updating, etc. With the rapid development and large-scale application of earth observation technologies, the RSIs obtained from satellites have become increasingly diversified, and the amount of RSIs has greatly increased. Among them, the very high-resolution (VHR) RSIs provide abundant spatial and textural information regarding their targets, and are widely used in target extraction and recognition [10], landcover classification [11], etc. The low-resolution RSIs tend to have a large field of view and contain more targets than VHR images of the same size, therefore attracting much attention in object detection [4,12,13] and tracking [14] tasks.
However, due to the limitations of low-resolution images, objects in low-resolution RSIs only occupy a few pixels (e.g., ships of 8 pixels) which are much smaller than normal, making it difficult to extract sufficient information. Moreover, in real-world scenarios, the quality of RSIs is always affected by the imaging conditions (e.g., illumination and clouds) and the characteristics of the sensors. These distractors make the image background more complicated, further increasing the difficulty of detection. Thus, detecting tiny objects in low-resolution RSIs is still a uniquely challenging task.
In general, we define the type of objects as follows: tiny objects are <16 pixels, small objects are 16~32 pixels, medium objects are 32~96 pixels and large objects are >96 pixels. Much research has been conducted to improve the performance of object detection, which can be roughly divided into traditional methods and deep learning-based methods. Specifically, traditional approaches mostly rely on prior information and handcrafted features to extract and classify regions of interest. Taking ship detection as an example, Gang et al. [15] presented a harbor-based method based on the assumption that the harbor layout is relatively stable. This method uses geographic information template matching technology to complete sea-land segmentation. Xu et al. [16] utilized a special threshold for segmentation, because the gray values and distributions of sea and land are usually different. Geometric features (e.g., aspect ratios and edge contours) and statistical features (e.g., HoG [17], LBP [18] and SIFT [19]) are adopted to represent candidate reigons, and classifiers (e.g., SVM [20] and AdaBoost [21]) are exploited to distinguish them.
Nevertheless, most existing methods focus on general object detection (e.g., scale variations and oriented bounding box regression) while ignoring the poor performance and special demands of tiny object detection. Meanwhile, the available remote sensing datasets are not perfectly suitable for tiny object detection in many aspects. As shown in Figure 1b,c, the resolution of most images is much higher (e.g., 0.20 m), and most labeled objects are larger than 32 × 32 pixels, which are defined as medium targets in MS COCO [24].
In this article, we seek to solve the remaining problems of detecting tiny ships in low-resolution RSIs. To this end, we propose a novel tiny ship detection framework called LR-TSDet. The objects are always surrounded by various backgrounds due to their small-scale characteristics, which will affect the feature expression. However, the background information can also provide certain indicative information for target identification. Therefore, we utilize global contextual information to capture the correlation between backgrounds and objects, thereby enhancing responses of objects in the feature maps. We propose the filtered feature aggregation (FFA) module to make use of complex backgrounds, which can be plugged into the feature pyramid network (FPN) [33]. As a self-attention mechanism [34], FFA calculates a similarity matrix to suppress background noise and strengthen features of objects. Secondly, we can only obtain limited information from lowresolution images, because the tiny objects can easily disappear due to the downsampling of the network. Thus, we have designed a hierarchical-atrous spatial pyramid (HASP) module to obtain deep semantic information while avoiding network downsampling. We reconstruct an enhanced atrous convolution layer [35] called hierarchical-atrous convolution block (HACB) using group convolution and hierarchical residual connection [36]. The HASP module aggregates four parallel HACBs, wherein each HACB represents different receptive fields, thereby enriching the semantic information of feature maps. Thirdly, in order to tackle the inconsistency between classification and regression subnets, we propose the IoU-Joint loss to guide the training of a classification network, in which the labels used to mark samples are replaced with an IoU score, inspired by [37]. The IoU is defined as the coincidence quality between the predicted bounding box and the ground-truth box. In this way, the network would prefer to predict high classification scores for positive samples with high IoU scores, thereby further improving the sample quality and localization accuracy. Furthermore, we have developed a new dataset named GF1-LRSD for the evaluation of tiny ship detection in low-resolution RSIs. It contains 4406 images with a resolution of 16 m and 7172 labeled instances, of which the mean size is 10.9 pixels; Figure 1 displays some samples. Our main contributions are summarized as follows.

•
An effective detector, LR-TSDet, is proposed to achieve tiny ship detection in lowresolution RSIs; this detector is equipped with a filtered feature aggregation (FFA) module, a hierarchical-atrous spatial pyramid (HASP) module and the IoU-Joint loss. • The FFA module is plugged into the FPN, which aims to suppress the interference of redundant background noise and highlight the response of regions of interest by learning global context information. • The HASP module is designed to extract multi-scale local semantic information through aggregating features with different receptive fields. • The IoU-Joint loss utilizes the IoU score to jointly optimize the classification and regression subnets, further refining the multi-task training process.

•
Extensive experiments on our built datasets, GF1-LRSD and DOTA-Ship, validate the performance of our proposed method, which outperforms other comparison methods by a large margin.
(a) (b) (c) The rest of this article is organized as follows. Section 2 briefly introduces the related works. Section 3 illustrates the proposed tiny ship detector, LR-TSDet, in detail, including the structure of each module, the design of the loss function, etc. In Section 4, we first describe the construction process and statistics of the collected dataset, and then present experimental results and discussions, respectively. Finally, conclusions are drawn in Section 5.

Object Detection in Remote Sensing Images
With the application of deep learning-based methods, we have witnessed the rapid development of object detection in remote sensing images in the past few years. Xia et al. [2] introduced a large-scale dataset named DOTA, which has gradually developed into a benchmark to evaluate the performance of various algorithms. Li et al. [3] built a more comprehensive dataset on both object categories and amount of images, further promoting the research of remote sensing. Generally, the CNN-based object detection methods can be approximately divided into two categories: anchor-based methods and anchorfree methods. The anchor-based methods [25,26,[38][39][40][41] utilize preset anchor boxes with different scales and aspect ratios to match and locate objects. YOLT [12] inherited and fine-tuned the YOLO network [40] and partitions large-scale images into slices for rapid detection. Zhang et al. [42] presented the CAD-Net, which revealed a special relationship between the background and object by capturing global and local contextual information. Furthermore, many studies have been performed to encode rotated features better, such as those on RoI transformer [29], DRBox-v2 [43], GWD Loss [44], Gliding vertex [45], S 2 A-Net [46], etc. The anchor-free methods [47][48][49][50] have been given more attention recently. These methods cancel all kinds of hyperparameters of anchors and provide a more concise pipeline for detection. For example, Wei et al. [51] proposed O 2 DNet, which encodes oriented objects as pairs of middle lines. Other models [52][53][54] have also been adopted using anchor-free strategies.

Tiny Object Detection
Researchers have attempted to alleviate the problem of tiny object detection from all aspects, including data augmentation [55], image pyramids [56] and super-resolution [57]. Yu et al. [58] proposed the Scale Match (SM) strategy, which aligns the scale distribution of a used dataset to be consistent with the pre-training dataset. SCRDet [59] obtained features of small objects by a tailored feature fusion structure. Hu et al. [60] found tiny faces by utilizing the contextual information around objects. In our work, we first exploit the FFA module to highlight useful features of tiny objects by capturing the mutual information between each pixel in the feature map. It can be observed that the background could indicate categories or locations of candidate objects, e.g., ships usually sail in the ocean. Furthermore, we apply improved atrous convolutions with different receptive fields to gather the features and capture deeper semantic information.

Methods
In this section, we first give an overview of the proposed network, LR-TSDet, for tiny ship detection, and show how it works. Next, we detail the design of the filtered feature aggregation (FFA) module for noise suppression and feature enhancement. Then, the hierarchical-atrous spatial pyramid (HASP) module is introduced to acquire larger receptive fields. Finally, we elaborate on the IoU-Joint loss function for high-precision detection. Figure 2 illustrates the details of the proposed LR-TSDet. We adopted the one-stage detector RetinaNet [26] as the baseline, which is a widely used anchor-based detector. Given an input image, we fed it into a backbone network to extract multi-scale features, which can usually take different forms of CNNs from existing detectors, such as ResNet [61], EfficientNet [62], Swin-Transformer [63], etc. Taking ResNet [61] as an example, different residual stages represent hierarchical semantic information. Therefore, we applied the feature pyramid network (FPN) [33] to construct a multi-scale convolutional feature pyramid with a top-down pathway and lateral connections. Finally, each FPN level was followed by a detection head, which included two different branches, named the classification subnet and box regression subnet. These two subnets are small fully convolutional networks (FCN) [64] with four stacked 3 × 3 convolution layers for predicting the probability and location of the object, respectively.

Overview
Different from RetinaNet, we constructed the pyramid from P 3 to P 5 using {C 3 , C 4 , C 5 } in ResNet, where P l and C l indicate the pyramid level and residual stage, respectively (l means the feature map resolution is 2 l lower than the input). ResNet50 pre-trained by ImageNet [22] was used as a backbone network. In our network design, we adopted a filtered feature aggregation (FFA) module in lateral connections to improve the quality of feature maps produced by FPN. In order to capture deeper semantic information better, we presented the hierarchical-atrous spatial pyramid (HASP) module before the detection heads, which uses dilated convolution [35] to obtain multiple receptive fields with different dilation rates while maintaining the spatial resolution of the features.

Element-wise Addition
Matrix Multiplication Softmax Operator Concatenation Figure 2. The network architecture of LR-TSDet. It consists of a backbone network, a feature pyramid network (FPN) [33] and multiple detection heads. The filtered feature aggregation (FFA) module is inserted between the backbone and FPN to enhance the capability of the top-down pathway. The detection head is appended to each FPN level, having a hierarchicalatrous spatial pyramid (HASP) and two subnets for object classification and box regression. During training, the classification and regression losses were calculated by the defined loss function, and we applied the back-propagation algorithm to update network weights. We presented an IoU-Joint loss to evaluate the network classification ability better, which merges the detection confidence and intersection-over-union (IoU) between the predicted result and the ground truth as the class label. For model inference, our LR-TSDet is straightforward. An image is fed and passed through the network to obtain the final results. We employed the non-maximum suppression (NMS) strategy with a threshold of 0.6 for removing redundant detections.

Filtered Feature Aggregation (FFA) Module
Convolutional neural networks extract features through locally connected layer and weight sharing while ignoring the long-range dependencies. Meanwhile, the feature maps obtained by the backbone often come with some disadvantages, such as the error response of the non-object with object-like and ambiguous responses of the objects to be detected. Concretely, the tiny objects in low-resolution RSIs do not always have sufficient discriminative features due to limitations in size, which makes them easy to confuse with backgrounds and other distractors. From the perspective of human vision, we distinguish the objects with the help of their surrounding environment information, and this indicates that global information is helpful to detect tiny objects.
To address these issues, we introduced the Filtered Feature Aggregation (FFA) module, which helps to suppress background noise and capture global contextual information. As illustrated in Figure 2b, the FFA exploits the non-local block [34] as the main component. Given an input feature map X ∈ R H×W×C , where C, H and W denote the channel number, height and weight of the feature map, respectively, we first employed a 1 × 1 convolution layer to reduce channel dimensions to 256 (we set the channel C = 256 in all pyramid levels following [26]). Then, we transformedX ∈ R H×W×256 to three different embeddings, marked as Query (Q ∈ R H×W×Ĉ ), key (K ∈ R H×W×Ĉ ) and value (V ∈ R H×W×Ĉ ), calculated as below: whereĈ is the channel number of the three embeddings, and W Q , W K and W V are weight matrices to be learned and implemented by different 1 × 1 convolution layers. Then, Q, K and V were reshaped to sizeĈ × N, where N = H × W represents the number of the spatial pixels. Next, we computed the similarity matrix S ∈ R N×N of Q and K, which represents the relation between each pixel in the feature maps, formulated as follows: where ⊗ denotes the matrix multiplication. Afterward, we obtained the spatial attention map by applying the softmax function, expressed as: where s ij represents the normalized pairwise relationship between position i and j. Thus, we computed the output matrix as follows: where O ∈ R N×C . Then, we reshaped O to the size H × W ×Ĉ, and a 1 × 1 convolution layer was employed to recover the initial dimension. Finally, we obtained the filtered feature map via a residual connection [61], calculated as follows: where F (·) denotes the aforementioned self-attention mechanism [65]. The FFA module leverages information from all locations to gain more discriminative feature representation, and we applied it to the top-down pathway, as shown in Figure 3. We replaced a 1 × 1 convolution layer with the FFA module to build a more robust FPN. The feature map was upsampled by a factor of 2 with bilinear interpolation. In particular, we visualized the feature maps of FPN after adopting the FFA module, as shown in Figure 4c. Figure 4b shows the original feature maps produced by RetinaNet. It can be observed that the false responses of backgrounds are suppressed and the network focuses on the targets more choicely.

Hierarchical-Atrous Spatial Pyramid (HASP)
Deeper networks usually require larger rates of downsampling to obtain richer semantic information. However, there is a trade-off between the scale of the object and the downsample rate. The tiny object may be lost in the feature map due to the decrease of spatial resolution. To this end, we propose the Hierarchical-Atrous Spatial Pyramid (HASP) module to mitigate this problem. Figure 2c describes the structure details.
An enhanced dilated convolution layer called Hierarchical-Atrous Convolution Block (HACB) was imported for stronger feature extraction capabilities. As shown in Figure 5, we replaced the standard convolution with the group convolution while connecting adjacent groups with residual connections. The HACB can capture deep semantic information in images from different depths, and the outputs of the current group were fed into the next group. Therefore, the equivalent receptive field increased consistently, and the module could integrate richer semantic information. For a given feature map X ∈ R C×H×W , the HACB first splits X into g groups, denoted by x i ∈ R C g ×H×W , where i ∈ {1, 2, · · · , g}. Except for x 1 , each x i is used to produce y i through a 3 × 3 dilated convolution layer D i (·) with the same dilation rate r (shown in Figure 6). Specifically, if i > 2, the sub-feature x i is first added with the output y i−1 and then fed into D i (·). Each D i (·) is followed by a group normalization (GN) layer [66] and a ReLU layer [67]. The implementation can be expressed as follows: Subsequently, all groups were aggregated by the concatenation operation, and the channel shuffle [68] operator was adopted for further information fusion, which can also be replaced with a simple 1 × 1 convolution layer for simplification. Notice that the feature information contained in each y i is gradually enriched by the hierarchical residual connections. Meanwhile, the use of dilated convolution can retain more details without reducing the spatial resolution of the feature maps. In the HASP design, we first used a 1 × 1 convolution layer to reduce the channel dimension for less computation. Then, four parallel HACBs with different dilation rates were applied to obtain multiple receptive fields. Next, the four branches were concatenated and passed through a 1 × 1 convolution layer, followed by a GN layer for adjusting the channel dimension. Finally, a skip-connection with an element-wise sum operator was utilized to gather the input and output for better information transmission.

Loss Function Design
In line with [25,26], our multi-task training loss function consists of two parts: the classification loss and the regression loss, formulated as follows: where L total , L cls and L reg denote the total training loss, classification loss and regression loss, respectively. N pos is the number of positive samples, p i represents the predicted probability value of the i-th anchor and q * i is the corresponding class "soft-label", which will be explained in the following subsection. pb i is the i-th predicted bounding box and gt * j is the j-th ground-truth box corresponding to pb i . 1 * i,j indicates the indicator function, being 1 for foreground and 0 for background, which means only positive samples contribute to the regression loss. The hyper-parameters {λ 1 , λ 2 } are two balancing weights and are set to {1, 1} by default.

IoU-Joint Classification Loss
Most of the existing detectors adopt two independent subnets for classification and regression tasks. These two branches optimize their own loss function and are almost irrelevant. Before calculating the loss, we defined the positive and negative samples in the same way as most detectors, such as Faster-RCNN and RetinaNet (the IoU > 0.5 is for positive samples and the IoU < 0.4 is for negative samples; they stand for one of the two-stage and one-stage detectors, respectively). This division is coarse, ignoring the impact of IoU changes, where different IoUs indicate different overlaps with ground truths and the different features in use. To alleviate these inconsistencies, we propose the IoU-joint classification loss, which utilizes the IoU calculated by the regression subnet as an auxiliary object index.
To be specific, we replaced the standard one-hot category label with the localization quality (i.e., the IoU score). The label was softened to a continuous variable q ∈ (0, 1), where 0 < q ≤ 1 indicates positive samples by IoU score and q = 0 is utilized for negative samples. In this way, each sample was weighted correspondingly, and the weight coefficient was directly correlated with the regression performance. Therefore, the network was guided more properly to suppress suboptimal results and predict detections having both high probability and high localization accuracy. Moreover, the discrete cross-entropy function − log(p) was needed to expand into a continuous form, written as −q log(p) + (1 − q) log(1 − p). We defined the IoU-joint classification loss as: where p denotes the predicted probability of the object, and q is the localization quality score (IoU between the predicted box and ground truth). (1 − log(1 + pq)) β is inherited from Focal Loss [26] as a modulating factor. We used the product of p and q to balance the contributions of samples, and the function log was used to smooth the decay of pq. When q = 0, the factor p β would be adopted to scale the loss. The hyper-parameter β was set as 2 by default.

Bounding-Box Regression Loss
Similar to the anchor-based detectors [25,26,38], we needed to parameterize the coordinates of the bounding box, formulated as follows: where (x a , y a , w a , h a ) represents the two coordinates of the box center, width and heigth of the anchor box. (x, y, w, h) and (x * , y * , w * , h * ) represent the predicted box and groundtruth box, respectively. The width and height of tiny objects in our dataset are generally about 10 pixels, and the smooth L 1 loss [38] is sensitive to scale variance, leading to difficulty in convergence. IoU evaluates the quality of the predicted box as a whole unit rather than four independent parameters, showing robustness to scale changes. Thus, we adopted the GIoU loss [69] for the bounding box regression. It is calculated as follows: where A c denotes the area of the smallest convex enclosing both the predicted box pb and the ground-truth box gt, and U is the area of union of pb and gt.

Experiments
In this section, we conduct different experiments to investigate the effectiveness of the proposed LR-TSDet. First, we introduce the datasets used in experiments. For the special demand of detecting tiny objects in low-resolution RSIs, we build a novel dataset called GF1-LRSD. The construction process and statistics of GF1-LRSD are also described. Furthermore, we build the DOTA-Ship dataset from DOTA-v1.5 [2] for further evaluation. Then, the implementation details and evaluation metrics are presented. Next, we conduct sufficient ablation experiments to evaluate the proposed modules. Finally, we compare the proposed method with other state-of-the-art (SOTA) methods and achieve the best performance. The implementation of this study will be publicly available after the article is accepted and the check procedure is completed.

GF1-LRSD
Object detection in remote sensing images has made great progress with the help of open-source aerial images datasets, such as HRSC2016 [28], DOTA [2], DIOR [3], etc. Nonetheless, the image resolution in these datasets tends to be very high (e.g., 0.20 m, 1.07 m), and objects are always multi-scale. These characteristics are more suitable for evaluating general detection tasks rather than tiny object detection. There is still a lack of a reliable dataset that can meet the practical migration application. To this end, we built the GF1-LRSD dataset to promote the research of the problem.

Raw Data Acquisition and Preprocessing
Gaofen-1 (GF-1) is an optical remote sensing satellite equipped with four 16 m resolution multispectral cameras which can obtain rich remote sensing images. Meanwhile, its complex imaging environment increases the difficulty of detection compared to other data. In order to build a sufficiently effective dataset, we collected a total of 145 wide-field-of-view (WFV) scenes of 1A level with a resolution of 16 m to filter the needed targets. The images with 12,000 × 12,000 pixels are 16-bit and have four bands (the extra is near-infrared), which are difficult to directly apply to the network. Figure 7 shows the detailed data processing flow. We converted the 16-bit data into 8-bit and cropped the large-scale images into a set of slices with the size of 512 × 512. Different from the regular sliding window mechanism, we directly cut the image without overlap for efficiency. As a result, nearly 83,520 sub-images were obtained. To enhance the contrast of images, we used the 2% truncated linear stretch method for quantification, calculated as follows: where I x,y,c and R x,y,c denote the pixel value at (x, y) in the c-th band of the input and output image, respectively. The R x,y,c is finally limited to 0∼255 to meet the standard format. T up and T down are the truncated upper and lower thresholds.

Image Annotation
We kept the data organization the same as PASCAL VOC [23] for convenience, wherein (xmin, ymin, xmax, ymax) is used to describe the labeled bounding box. Let (xmin, ymin) and (xmax, ymax) denote the coordinates of the top-left and bottomright corners of the bounding box, respectively. The toolbox LabelImg [70] was used to finish the annotation, and we used the horizontal rectangular box to locate the objects.
After the identification and correction by experts, we collected, in total, 4406 images and 7172 labeled instances labeled as ship. For dataset splits, 3/5, 1/5, 1/5 of the images were used to form the training set, validation set and test set. Some samples are shown in Figure 1a.

Dataset Statistics
In this subsection, the statistical characteristics of the proposed GF1-LRSD are analyzed and compared with other representative datasets. Specifically, we define the absolute size S a (·) and relative size S r (·) to describe the scales of instances, which can be formulated as follows [58]: where B ij represents the j-th instance's bounding box of the i-th image I i in the dataset, and w ij , h ij are the width and height of B ij . W i , H i denote the width and height of I i , respectively. The mean and standard deviation of the instance size for different datasets are shown in Table 1. The absolute size of 10.9 ± 3.0 pixels in GF1-LRSD is much smaller than the other datasets. As shown in Figure 8c, most objects in GF1-LRSD are smaller than 16 pixels, accounting for about 94% of the objects, while more than 50% of the objects in other datasets have scales greater than 16 pixels, such as 79% of the objects in DIOR. Figure 8a,b further describe the main characteristics of GF1-LRSD. The width and height of the objects are mostly smaller than 25 and 30 pixels, respectively. The top 3 sizes are 9, 10 and 11 pixels.

DOTA-Ship
DOTA [2] is a large-scale dataset for object detection in remote sensing images. DOTA-v1.5 contains 2806 images and 403,318 instances of 16 object categories. It is an updated version of DOTA-v1.0, where the tiny instances (less than 10 pixels) are additionally annotated. To evaluate our LR-TSDet more accurately, we selected the objects labeled as ship and built a new dataset, named DOTA-Ship, which includes 573 images and 43,738 instances in total. DOTA-Ship was divided into training and test sets, consisting of 435 and 138 images, respectively. During training, we cropped the original images into 800 × 800 patches with an overlap of 200 pixels and subsequently ignored the sub-images that do not contain targets.

Implementation Details
We implemented LR-TSDet based on mmdetection [71], and the pre-trained ResNet-50 was adopted as the backbone network for all experiments. The models were trained for 100 epochs using the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.005. The learning rate was divided by a factor of 10 at the 70th and 90th epochs. The momentum and weight decay were set as 0.9 and 1 × 10 −4 , respectively. We applied the linear warm-up strategy for the first 500 iterations with a ratio of 0.001 to stabilize the training process. The batch size was set as 8 on 2 RTX 2080Ti GPUs (4 images per GPU) and the random image flipping was adopted for data augmentation. The other hyper-parameter was the same as mmdetection unless specified.

Evaluation Metrics
In all experiments, the average precision (AP) was adopted to evaluate the model performance. We followed the setup proposed in the VOC2010 challenge [23] using a threshold of 0.5. The AP is formulated as follows: where r denotes different recalls, and p(r) is the precision-recall (PR) curve. The AP was calculated as the area under the PR curve. In general, we defined the recall and precision as follows: where TP, FP and FN represent the true positive, false positive and false negative samples. For a detection result, if the IoU between it and the ground truth was greater than the set threshold, then it was defined as TP; otherwise, it was defined as FP. If a ground-truth did not have a matched predicted result, it was defined as FN.

Ablation Studies
In this subsection, we conduct several ablation experiments to explore the effectiveness of the proposed method. We adopt ResNet50 as the backbone network and the flops computation tool in mmdetection is utilized to analyze model performance.

RetinaNet as Baseline
Before applying RetinaNet [26] as the baseline, we first modified the default model (called RetinaNet-D) to reduce useless operations. In the RetinaNet-D design, {P6, P7} in FPN are obtained by strided convolutions. This is done with the aim of improving large object detection, which may miss tiny object information and generate many unmatched negative samples that adversely affect the network training; our experiments also proved this. We removed the last two stages, {P6, P7}, and added a GN [66] layer in each detection head, named RetinaNet-B. The GN layer normalizes the data distribution by dividing the channels into groups and computing the mean and variance respectively. GN is a useful trick [47] and we applied it to stabilize the training process. The results are presented in Table 2. Compared with RetinaNet-D, our RetinaNet-B produced +0.75 AP gains with less computation complexity (51.57 G vs. 52.28 G), and the model size was reduced from 36.1 M to 30.8 M, indicating that the RetinaNet-D is more suitable as a baseline. In order to investigate the performance of design elements, we conducted a series of experiments with different combinations of modules. The quantitative results are reported in Table 3. We adopted RetinaNet-B (mentioned before) as the baseline, and it achieved an AP of 79.75, as shown in Experiment #1 in Table 3. Table 3. Ablation studies on LR-TSDet. We adopted RetinaNet-B as the baseline and applied each module gradually to evaluate the effectiveness.  Table 3. FFA utilizes non-local blocks [34] to capture long-range dependencies, which is helpful to obtain global contextual information. Meanwhile, as a kind of attention mechanism, FFA could enhance the feature expression of targets and make it more discriminative. • Influence of HASP. The purpose of designing HASP was to obtain more abundant semantic information while maintaining the resolution of feature maps. To verify its impact, we added HASP on the basis of Experiment #2, and the result is shown in Experiment #3 in  Table 3. • Effect of the loss function. As analyzed above, our network is optimized by a multitask loss, including the classification loss and regression loss. The default setting was focal loss [26] and smooth L 1 loss [38] in our experiments. As can be seen in Experiment {#3, #4, #5, #6} in

Evaluation of HASP
In this subsection, we study the choice of the dilation rates and the utility of HACB architecture in HASP. It can be observed from Table 4 that as the dilation rate increases, the performance first increases and then decreases. We conjecture that excessive rates would incur the "gridding problem" [72], where the useful local information may be lost; thus we choose {2, 4, 6, 8} as the final parameters. Moreover, two sets of controlled experiments {#2, #3} and {#4, #5} in Table 4 prove the superiority of the HACB over the standard atrous convolution by adopting the hierarchical residual connection structure, where the HACB brings considerable gains of +0.60 and +0.56 AP under different settings of dilation rates, respectively. To verify the performance of our proposed LR-TSDet, we compared it with other methods on the GF1-LRSD dataset, including two-stage detectors (e.g., Faster-RCNN [25] and SCRDet [59]), one-stage detectors (e.g., YOLOv3 [40], SSD [39] and R 3 Det [30]) and anchor-free detectors (e.g., FCOS [47] and ATSS [73]). SCRDet and R 3 Det are two typical methods for detecting tiny objects in remote sensing images. It should be noted that we kept all training settings the same, except the network backbone. As observed in Table 5, we achieved the best performance of 83.87 AP with a competitive model size and computation complexity. For example, our LR-TSDet outperformed Faster-RCNN by a large margin (+23.38 AP) with fewer FLOPs (54.67 G vs. 63.25 G) and parameters (32.53 M vs. 41.12 M), and the LR-TSDet surpassed FCOS by 9.71 AP with a slight increase in FLOPs and parameters. Qualitative detection results of LR-TSDet on GF1-LRSD are presented in Figure 9. The data were collected from real satellite imaging scenes, including the occlusion and interference of clouds, and the presence of vast land backgrounds. According to the detection results, our method works well under different conditions, proving its robustness. Figure 10 displays the P-R curves of the different approaches. The LR-TSDet is shown to locate objects more accurately with higher confidence. Furthermore, we evaluated the performance of our LR-TSDet under different scenarios, comparing it with RetinaNet. Experiments were conducted for offshore and inshore scenes. The results are shown in Table 6. It can be seen that our method produced a larger improvement under both scenes. Specifically, LR-TSDet improved the precision rate by 1.11 and the recall rate by 4.23 in inshore backgrounds, which indicates fewer false alarms and more correct predictions. In addition, it achieved 86. 43   To further demonstrate our proposed method, we also conducted experiments on the DOTA-Ship dataset. The models were trained for 48 epochs in total. The results are shown in Table 7. It can be observed that our LR-TSDet achieved an AP of 82.56 and performed better than other competitors. For example, our method produced considerable improvements of 6.98 AP by being carefully designed for tiny ship detection (e.g., the FFA module for global contextual information and the HASP module for deeper semantic information) compared with the baseline RetinaNet [26]. Some detection results are visualized in Figure 11.

Conclusions
In this article, we proposed an effective network architecture called LR-TSDet for improving the performance of tiny ship detection in low-resolution images. LR-TSDet includes three main components: the FFA module, the HASP module, and the IoU-Joint loss. Specifically, the FFA module was adopted to filter the background noise with the ability to capture long-range dependencies in feature maps in order to build a more robust FPN for detecting tiny objects. The HASP module was presented to obtain richer semantic information while maintaining the resolution of feature maps by aggregating four parallel HACBs, which is conductive to distinguishing tiny objects and the background. The IoU-Joint loss utilized the IoU score to alleviate the inconsistency between the classification and regression branches, and consequently improved the localization accuracy. To assess the feasibility of the proposed method, we constructed a dataset for low-resolution tiny ship detection in remote sensing images, called GF1-LRSD, in which the resolution (16 m) of images and the average size (10.9 ± 3.0 pixels) of instances are much smaller than available datasets. Comprehensive experiments on GF1-LRSD and DOTA-ship datasets demonstrated the efficacy of our LR-TSDet, which outperformed other comparison approaches.
Author Contributions: J.W. and Z.P. generated original ideas. J.W. designed and implemented the algorithm. Z.P., B.L. and Y.H. provided the experimental data and supervised the research. J.W. and Z.P. processed and analyzed the experimental results. The original draft was written by J.W. and reviewed by all authors. All authors have read and agreed to the published version of the manuscript.