4.1. Datasets and Evaluation Metric
To comprehensively validate the effectiveness of our proposed method, we conducted experiments on three widely used tiny object detection datasets for remote sensing: VisDrone2019 [
23], AI-TOD [
59], and AI-TODv2 [
60]. In addition, we validated our method on a recent tiny object detection benchmark dataset (SODA-D [
13]) and a general object detection dataset (MS COCO 2017 [
12]). The fundamental details of the datasets employed in the experiments are provided in
Table 1.
VisDrone2019 Dataset: The VisDrone2019 dataset [
23] is a widely used benchmark for tiny object detection. It comprises 10,209 static images and 261,908 video frames extracted from 288 video clips captured by Unmanned Aerial Vehicles (UAVs) under varying scales, angles, and lighting.
AI-TOD and AI-TODv2 Datasets: The AI-TOD dataset [
59] is a large-scale benchmark for tiny object detection using aerial images from remote sensing satellites. It includes 700,621 annotated objects across eight categories sourced from five public datasets, with an average object size of only 12.8 pixels. The updated AI-TODv2 dataset [
60] retains the same number of images but contains additional object instances with smaller sizes and more precise annotations.
SODA-D Dataset: The SODA-D dataset [
13] is a recently published benchmark for tiny object detection containing 278,433 high-quality instances collected from the MVD dataset [
61] and various online sources. This dataset focuses on detecting tiny objects in diverse scenes, including city roads, highways, and rural areas.
MS COCO 2017 Dataset: The MS COCO 2017 dataset [
12] is a widely used dataset for general object detection comprising over 118,000 images and 860,000 object instances across 80 categories. It serves as the official benchmark dataset for the MS COCO Detection Challenge.
Evaluation Metric: Following previous research, we chose the most commonly used COCO metric [
12] to evaluate the detection performance of our method. This metric contains two basic values, namely, Precision (
P) and Recall (
R), respectively defined in Equations (
7) and (
8):
where
is an abbreviation for True Positive, representing the number of positive output samples with correct predictions;
is an abbreviation for False Positive, representing the number of positive output samples with incorrect predictions; and
is an abbreviation for False Negative, representing the number of negative output samples with incorrect predictions. Based on Equations (
7) and (
8), we can obtain definitions for the Average Precision (
) and Mean Average Precision (
), as shown in Equations (
9) and (
10):
where
represents the area under the Precision–Recall (P-R) curve, which ranges from zero to one. The
provides the average value of the
for each category, with
denoting the number of categories in the whole dataset.
For each detector, we report , , , , , , , and . Among these, indicates that the mean average precision under the IoU ranges from 0.5 to 0.95 at intervals of 0.05, and respectively indicate that the mean average precision under the IoU equals 0.5 or 0.75, and , , , , and represent the average precision values for objects in the very tiny, tiny, small, medium, and large size categories, respectively.
4.2. Implementation Details
We implemented all of our experiments using the common MMDetection framework [
62] and PyTorch [
63]. For all models, we utilized ResNet-50 [
64] pretrained on ImageNet [
65] as the backbone. For training, we used Stochastic Gradient Descent (SGD) as the optimizer with momentum of 0.9 and weight decay of 0.0001. The batch size and number of workers were set to 2 and 4, respectively. The initial learning rate was set to 0.005 for Faster R-CNN, Cascade R-CNN, and DetectoRS and to 0.001 for RetinaNet and FCOS.
For the VisDrone2019, SODA-D, and MS COCO 2017 datasets, in line with [
13], the number of training epochs was set to 12, with the learning rate decaying by a factor of 0.1 at epochs 8 and 11. For the AI-TOD and AI-TODv2 datasets, we followed the same configuration as in [
15,
40] to facilitate comparison with state-of-the-art methods. We trained the models for 24 epochs, applying a learning rate decay of 0.1 at epochs 16 and 22. The RPN proposal counts for Faster R-CNN, Cascade R-CNN, and DetectoRS were all set to 3000 for both the training and testing phases. All other configurations, including the data preprocessor and anchor generator, were kept consistent with the default settings in MMDetection [
62]. We conducted all of the following experiments on a computer with an Intel Xeon Gold 6326 CPU and NVIDIA RTX A6000 GPU. The versions of MMDetection, PyTorch, Python, and CUDA used are equal to 3.1.0, 2.0.1, 3.9.19, and 12.2, respectively.
4.3. Results
We conducted a series of experiments on four tiny object detection datasets (VisDrone2019, AI-TOD, AI-TODv2, and SODA-D) and a general object detection dataset (MS COCO 2017). All results reported in the tables are sourced from the corresponding papers; thus, some methods may only have results on certain datasets.
Results on VisDrone2019 Dataset: VisDrone2019 dataset is a challenging benchmark with significant scale variation, containing both tiny and general-size objects. We compared the detection performance of several commonly used object detectors before and after incorporating our proposed Joint Optimization Loss (JOL), as shown in
Table 2. The first six rows display the baseline results, while the last five rows illustrate the results achieved with our method. For the one-stage RetinaNet and FCOS detectors, our method results in
improvements of 6.3 and 5.8 points, respectively. For the two-stage Faster R-CNN, Cascade R-CNN, and DetectoRS detectors, our method yields
improvements of 4.6, 5.3, and 3.6 points, respectively. Compared with the best-performing label assignment strategy, SimD, our method results in a 0.6-point improvement in
, as shown in the sixth row of
Table 2. The performance improvements are particularly notable for tiny and very tiny objects. For example, Faster R-CNN achieves detection accuracy of only 0.1 (
) and 6.2 (
) points after applying JOL, representing increases of 8.1 and 16.6, respectively, underscoring the effectiveness of our method for detecting tiny objects. Typical visual comparisons of detection performance on the VisDrone2019 dataset are shown in
Figure 4, where the improvements brought about by our method are obvious.
Results on AI-TOD Dataset: We conducted a comprehensive comparison of the detection performance of our method against various mainstream object detectors on the AI-TOD dataset, as shown in
Table 3. The first four rows present the performance of the two-stage anchor-based detectors, rows 5–7 display the results from the one-stage anchor-based detectors, and rows 8–10 report the results for the anchor-free detectors. Notably, we also include state-of-the-art detection results from recently published research in rows 11–15. The final six rows illustrate the detection performance of our method combined with baseline and state-of-the-art detectors. Compared with base detectors, including RetinaNet, FCOS, Faster R-CNN, Cascade R-CNN, and DetectoRS, our method achieves
improvements of 6.4, 5.2, 13.6, 11.7, and 12.4 points, respectively, demonstrating its adaptability across various detectors. Additionally, when combined with an existing state-of-the-art method on the AI-TOD dataset, our method yields a further improvement pf 1.7 points in terms of
. Visual comparisons of the detection results and Precision–Recall (P-R) curves before and after applying our method are depicted in
Figure 5 and
Figure 6, respectively, highlighting the significant enhancement in performance.
Results on Other Related Datasets: The detection performance on the AI-TODv2 dataset is presented in
Table 4, where we also report results from several widely used detectors of the two-stage anchor-based, one-stage anchor-based, and anchor-free varieties. Compared with the baseline RetinaNet, FCOS, Faster R-CNN, Cascade R-CNN, and DetectoRS detectors, our method achieves
improvements of 7.3, 6.6, 12.2, 10.3, and 10.7 points, respectively. To further demonstrate the effectiveness of our method, we combined it with a state-of-the-art label assignment strategy for tiny object detection, as shown in the eleventh and final rows of
Table 4, resulting in an increase in
from 26.5 to 28.0 points.
Table 5 shows detection results on the latest SODA-D tiny object detection benchmark. The first eight rows display results from existing high-performance detectors, while the last six rows present results obtained using our method. The improvements achieved by our method are substantial, especially for DetectoRS, where our method outperforms the best alternative by 1.5
points. From
Figure 7, it is evident that our method successfully detects tiny objects across various categories, demonstrating its strong ability on tiny object detection tasks. To assess the adaptability of our method, we conducted an additional comparison on the MS COCO 2017 dataset, with the results shown in
Table 6. Our JOL achieves an
improvement of 1.1 points, indicating that it is also effective for general object detection tasks.
4.4. Ablation Study
Effectiveness of Regression and Classification Scores. Our proposed Joint Optimization Loss (JOL) incorporates both bounding box regression scores and classification scores to calculate the weight of each training sample. To further validate the effectiveness of our approach, we conducted an ablation study comparing performance when using only the classification score or only the bounding box regression score, with the results shown in
Table 7. The first row displays the baseline results on AI-TOD dataset with Faster R-CNN as the detector. The second and third rows respectively show the results when using only the bounding box regression score (
) and only the classification score (
) to calculate each sample’s weight. The fourth row presents the results with the full implementation of our method. From the results in
Table 7, it can be observed that there are respective
improvements of 12.3 and 12.1 points when considering only the bounding box regression score and only the classification score. By incorporating both types of score simultaneously, we achieve an
improvement of 13.6 points over the baseline, highlighting the effectiveness of considering both scores. This result underscores the interdependence of bounding box regression and classification, which are inseparable subtasks in object detection.
Performance with Different Values of . The
parameter is used to control the scale of the weights. To determine the optimal value, we conducted an ablation study to evaluate the performance of models with different values of
. As shown in
Table 8, Faster R-CNN was used as the base detector, with all models trained on the AI-TOD trainval set and tested on the AI-TOD test set. From the results presented in
Table 8, it can be observed that the model’s accuracy tends to stabilize when
exceeds two. Therefore, we set
equal to two.
Analysis of Inference and Training Cost. In order to comprehensively evaluate our method, we also calculated the time and space cost of our model during inference and training, then compared the cost with that of baseline methods, including the GFLOPs and parameters of the models during inference and the speed and model size during training. As shown in
Table 9, we used the RetinaNet and FOCS one-stage detectors as well as the Faster R-CNN two-stage detector as the base detectors. The models were trained on the AI-TOD trainval set and tested on the AI-TOD test set. The GFLOPs during inference were calculated for an input image size equal to 800 × 800. The conditions for calculating the speed of the training process were the same as the experimental settings mentioned above, using an NVIDIA RTX A6000 GPU for training and a batch size equal to 2.
From the comparison between the baseline methods and our proposed method shown in
Table 9, it is obvious that the GFLOPs, parameters during inference, and model size of our proposed JOL all remain consistent with those of the baseline methods, with only a slight decrease in the training speed of the model. Considering that our method can significantly improve the detection accuracy, our proposed JOL shows good overall performance on tiny object detection tasks.