TranSDet: Toward Effective Transfer Learning for Small-Object Detection

Xu, Xinkai; Zhang, Hailan; Ma, Yan; Liu, Kang; Bao, Hong; Qian, Xu

doi:10.3390/rs15143525

Open AccessArticle

TranSDet: Toward Effective Transfer Learning for Small-Object Detection

by

Xinkai Xu

^1,2,3

,

Hailan Zhang

¹

,

Yan Ma

^2,3,*

,

Kang Liu

¹

,

Hong Bao

^2,3 and

Xu Qian

¹

School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing 100083, China

²

Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China

³

College of Robotics, Beijing Union University, Beijing 100027, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(14), 3525; https://doi.org/10.3390/rs15143525

Submission received: 2 June 2023 / Revised: 27 June 2023 / Accepted: 8 July 2023 / Published: 12 July 2023

(This article belongs to the Special Issue Deep Learning and Computer Vision in Remote Sensing-II)

Download

Browse Figures

Versions Notes

Abstract

:

Small-object detection is a challenging task in computer vision due to the limited training samples and low-quality images. Transfer learning, which transfers the knowledge learned from a large dataset to a small dataset, is a popular method for improving performance on limited data. However, we empirically find that due to the dataset discrepancy, directly transferring the model trained on a general object dataset to small-object datasets obtains inferior performance. In this paper, we propose TranSDet, a novel approach for effective transfer learning for small-object detection. Our method adapts a model trained on a general dataset to a small-object-friendly model by augmenting the training images with diverse smaller resolutions. A dynamic resolution adaptation scheme is employed to ensure consistent performance on various sizes of objects using meta-learning. Additionally, the proposed method introduces two network components, an FPN with shifted feature aggregation and an anchor relation module, which are compatible with transfer learning and effectively improve small-object detection performance. Extensive experiments on the TT100K, BUUISE-MO-Lite, and COCO datasets demonstrate that TranSDet achieves significant improvements compared to existing methods. For example, on the TT100K dataset, TranSDet outperforms the state-of-the-art method by 8.0% in terms of the mean average precision (mAP) for small-object detection. On the BUUISE-MO-Lite dataset, TranSDet improves the detection accuracy of RetinaNet and YOLOv3 by 32.2% and 12.8%, respectively.

Keywords:

object detection; transfer learning; dynamic resolution adaptation; small-object detection

1. Introduction

With the application of automatic feature engineering in deep learning methods, significant progress has been made in object detection tasks in recent years [1,2,3,4,5,6,7], achieving accurate object recognition and localization in multiple scenarios. Existing cutting-edge object detection methods mainly focus on large objects, whereas small-object detection remains a challenging task due to the limited training samples and low-quality images. However, small-object detection is essential in many real-world applications of object detection. For example, in traffic scenes, the detection of traffic signs, small vehicles, and pedestrians is crucial for road safety [8,9,10], whereas in mining scenes, the detection of small objects such as miners and mining carts plays a positive role in enhancing safety and production efficiency [11,12,13]. Therefore, research on methods and technologies for small-object detection has significant academic and practical value.

In deep learning, the performance of methods is deeply affected by the size of the training dataset, and models trained on small datasets usually exhibit inferior performance. A solution to this problem is transfer learning [14,15,16,17,18], which aims to improve performance on small datasets by initializing the model with weights learned on a large dataset. Therefore, an intuitive idea for improving small-object detection performance is to adapt transfer learning to it. However, we empirically found that transferring the model from a large general object dataset (e.g., COCO [19]) to the target small-object dataset performs poorly. As shown in Figure 1, we transferred the Faster R-CNN model trained on COCO to the TT100K-Lite dataset, and the resulting detection performance AP

_{50}

was even worse than the AP

_{50}

of a traditional training strategy without transfer learning. A possible reason for this counter-intuitive failure is that the proportion of small objects in general object datasets is smaller than that of medium and large objects. Therefore, the learned weights are less effective for small objects. We present the distributions of small, medium, and large objects in popular object detection datasets in Table 1. Small objects [19] are defined as objects with sizes less than 32 square pixels, whereas large objects are defined as objects with sizes greater than 96 square pixels. In general object datasets, the proportion of small objects is low, whereas most objects in small-object datasets are small- and medium-sized. This discrepancy in data distribution between general and small-object datasets restricts the effectiveness of transfer learning.

This paper presents TranSDet, a novel method for effective transfer learning for small-object detection, which aims to reduce the transfer discrepancy from a general object dataset to a small-object dataset by incorporating an additional dynamic resolution adaptation scheme. Specifically, we propose to adapt a model trained on a general dataset to a small-object-friendly model by augmenting the training images in the general dataset with diverse smaller resolutions. This gradually shifts the weights toward small objects without losing the discriminative information in the original model. The diverse resolutions are used to ensure consistent performance on various sizes of objects, and we introduce a meta-learning scheme to balance the learning of resolutions.

Another mainstream solution for improving small-object detection performance is to enhance the model architecture. For instance, FPN [25] connects features at different scales to enhance the semantic information of shallower features with deeper ones. Carafe [26] proposes learning image upsample kernels for better fusion of features. FPG [27] further improves the feature pyramid with fused multi-directional lateral connections, and MFR-CNN [28] combines global information with locally extracted multi-scale features to augment small-object features. However, most of the existing works require learning additional network modules, which inevitably disturb the pretrained features in transfer learning and make them incompatible with transfer learning. In this paper, we propose two components for small-object detection networks that are effective, efficient, and transfer learning-friendly: (1) SFA-FPN. We state that the traditional upsample operations in FPN perturb the features on shallower layers by interpolating pixels to their neighboring pixels, resulting in the imprecise recognition and localization of small objects. To address this problem, we propose an SAF-FPN module that uses shifted feature aggregation to shift the upsampled pixels to their correct positions. (2) Anchor relation module. We introduce an anchor relation module that captures the relationship between each object anchor with transformer blocks to enhance the anchor features. By combining both network components with the proposed dynamic resolution adaptation transfer learning, our TranSDet can achieve further improvements.

The contributions of this work can be summarized as follows:

(1): We propose a meta-learning-based dynamic resolution adaptation scheme for transfer learning that effectively improves the performance of transfer learning in small-object detection.
(2): We propose two network components, an SFA-FPN and an anchor relation module, which are compatible with transfer learning and effectively improve small-object detection performance.
(3): We conduct extensive experiments on the TT100K [22], BUUISE-MO-Lite [11], and COCO [19] datasets. The results demonstrate that our method, TranSDet, achieves significant improvements compared to existing methods. For example, on the TT100K-Lite dataset, TranSDet improves the detection accuracy of Faster R-CNN and RetinaNet by 8.0% and 22.7%, respectively. On the BUUISE-MO-Lite dataset, TranSDet improves the detection accuracy of RetinaNet and YOLOv3 by 32.2% and 12.8%, respectively, compared to the baseline models. These results suggest that TranSDet is an effective method for improving small-object detection accuracy using transfer learning.

2. Related Works

2.1. Small-Object Detection

In recent years, deep learning has gained increasing attention and has been successfully applied in many practical applications, leading to significant progress in object detection. However, the majority of prior efforts have been tuned for large-object detection, leaving limited experience and knowledge for small-object detection [29]. Detecting small objects in computer vision remains a challenging task [30]. Firstly, the features generated by basic CNNs lack the information needed for small-object detection. Secondly, small objects lack appearance information and have more location possibilities, requiring higher precision for accurate localization. Thirdly, context information is lacking, making it difficult to differentiate small objects from their surroundings. Fourthly, there is an imbalance of foreground and background training examples, and an insufficient number of positive training examples for small objects, making classification difficult. Even state-of-the-art networks exhibit significant performance gaps between the detection of small- and normal-sized objects. For example, DyHead [31] achieved only a 28.3% mean average precision (mAP) for small objects on the COCO test set, significantly lagging behind medium (50.3%) and large (57.5%) objects.

For small-object detection, low-level features of convolutional neural networks are often more effective than high-level features [32]. To fully utilize the semantic information from high-level features and the fine-grained features from low-level features, researchers have employed various methods to improve the detection accuracy of small objects. For example, SSD [3] increases the depth of the feature extraction network and uses a dense connection structure to improve small-object detection accuracy. FPN [25] connects feature maps of different scales from top to bottom to enhance the features at each scale. YOLOv3 [33] detects small, medium, and large objects on three independent feature maps of different scales. Carafe [26] proposes learning image upsample kernels instead of traditional interpolation operations, achieving better fusion quality. FPG [27] improves the feature pyramid with fused multi-directional lateral connections. PANet [34] enhances the entire feature hierarchy by using accurate localization signals in the lower layers via a bottom-up pathway. MFR-CNN [28] combines global information with locally extracted multi-scale features, improving detection accuracy for small objects and severely occluded objects in traffic scenes. SODNet [35] enhances small-object detection accuracy with a spatial parallel convolution module, split-fusion sub-module, and fast multi-scale fusion module, achieving high accuracy and real-time performance on multiple benchmark datasets. FE-CenterNet [36] enhances small-object detection accuracy with an attention mechanism, feature enhancements, and an anchor-free architecture, achieving a 7.2% higher AP metric with a 1.3 FPS decrease. SRODNet [8] improves vehicular detection accuracy by modifying the residual block in the super-resolution module and optimizing it jointly with YOLOv5. DetectFormer [9] incorporates a ClassDecoder and global information, with data augmentation and an attention mechanism in the backbone network, to enhance category sensitivity and real-time detection performance for traffic scenes. AMMFN [37] enhances small-object detection accuracy on remote sensing images with multi-scale feature fusion, attention mechanisms, and a normalized Wasserstein distance and generalized intersection-over-union location regression loss function. FusionPillars [38] utilizes the Set-Abstraction-Self (SAS) fusion module and the Pseudo-View-Cross (PVC) fusion module to fuse multisensor data for 3D object detection, resulting in enhanced detection precision and performance in detecting smaller objects.

There is another direct solution for reducing the resolution of the targets by increasing the size of input images, which enables the acquisition of high-resolution feature maps. MS-CNN [39] significantly improves the detection performance of small objects by adding an upsampling layer to the feature maps obtained by the deconvolutional layer. STDnet [40] employs a visual attention mechanism to select the most promising regions and discards the rest of the input image, thereby preserving high-resolution feature maps in deeper layers. Cascade R-CNN++ [41] employs an ensemble strategy, a modified loss function, and enhanced bounding box regression to effectively detect small objects in multi-resolution remote sensing images, outperforming previous methods. Unfortunately, these deep learning-based object detection algorithms strongly rely on massive training data, and they often perform poorly in situations with small samples.

2.2. Transfer Learning in Object Detection

Deep learning algorithms attempt to learn high-level features from massive amounts of data, which allows deep learning to go beyond traditional machine learning. However, collecting data is complex and expensive, especially in specific domains where it is very difficult to construct large-scale, high-quality datasets with annotations. Transfer learning relaxes the assumption that training data must be independent and identically distributed from test data, making it a better choice to use transfer learning to solve the problem of few-shot object detection.

Deep transfer learning is the study of how to utilize knowledge from other domains through deep neural networks. With the wide application of deep neural networks in various fields, a large number of deep transfer learning methods have been proposed. For example, Redmon et al. [42] jointly trained large classification and smaller detection datasets, allowing feature transfer between tasks to boost small-object detection accuracy. Wang et al. [43] revealed that fine-tuning only the final layer of the object detector on a balanced subset while keeping the rest of the model fixed, significantly improves detection accuracy. Liang et al. [44] introduced a transfer learning method utilizing residual thought and dilated convolutions. The method is initialized with large-scale datasets and aims to address issues such as low image resolution and partial occlusions. Wang et al. [45] presented a deep transfer learning model using a modified ResNet-50 model and scale feature learners for bearing-fault diagnosis, resulting in a reliable and generalizable model. Loey et al. [46] proposed a hybrid deep transfer learning model using Resnet50 for feature extraction and ensemble algorithms, achieving up to 100% accuracy in face-mask detection on three datasets. Tang et al. [47] actively queried labels for the bounding boxes of source images using informativeness and transferability criteria to improve the target model with cost-effective supervision from source data. Sun et al. [48] utilized contrastive learning to train a proposal encoder, transferring knowledge from large datasets to few-shot target detection tasks, thereby enhancing model performance. Zhu et al. [49] enhanced few-shot target detection by introducing a semantic relation reasoning module. Kaul et al. [50] achieved comparable accuracy to traditional methods in few-shot scenarios by rapidly adapting to target categories and backgrounds through labeling, verification, and correction. Yan et al. [51] proposed an improved Faster R-CNN for tailings pond detection using a step-by-step transfer learning approach and increased inputs to four multispectral bands, resulting in more precise detection and a higher recall rate.

However, most existing methods focus on transfer learning between datasets of similar scales, with limited work considering the transfer of a model trained on large, general datasets to small-object, few-shot datasets. We observed that the aforementioned approaches underperformed in this context, leading us to propose a novel transfer learning method tailored for small-object detection.

3. Methodology

In this section, we formulate our proposed method TranSDet, which consists of two main components: (1) We propose a dynamic resolution adaptation scheme to better transfer the model trained on a normal dataset to a small-object detection dataset. (2) We propose a new small-object detection module to enhance small-object detection performance without affecting transfer learning efficacy.

3.1. Dynamic Resolution Adaptation Transfer Learning

We first review traditional transfer learning methods on object detection [43]. As illustrated in Figure 2, to improve object detection performance on a few-shot dataset, current transfer learning methods aim to first train a model on a large and general dataset (stage I), and then transfer the learned weights to the target few-shot dataset (stage III). By adopting this simple two-stage learning strategy, a model initialized with a pretrained model on a large dataset can obtain discriminative features on both foreground-background classification and object classification, thereby improving its generalization on few-shot datasets.

However, most methods are designed to transfer knowledge from a general dataset to another general few-shot dataset, whereas for a small-object few-shot dataset, we have empirically found it difficult to achieve significant transfer performance compared to a normal few-shot dataset, as shown in Figure 1. In this paper, we regard the task of solving the dataset gap in object sizes as an adaptation task, which encourages the model trained on general datasets to adapt to small-object detection datasets. As a result, we propose a meta-learning-based dynamic resolution adaptation transfer (DRAT) learning scheme to efficiently and effectively adapt the pretrained general model to a small-object-friendly model. As shown in Figure 2, we introduce an additional stage (stage II) to adjust the pretrained model using DRAT, and then transfer the adjusted model to the target dataset.

The pretrained model on a general dataset is trained with only a small proportion of small objects. To increase its generalization for small objects, we aim to resize the input images to a small resolution, so that the medium and large objects in the original resolution become small objects in the small resolution. Meanwhile, considering that the object sizes in our target dataset are not identical, we propose to fine-tune the pretrained model with multiple resolutions to allow it to adapt well to all object sizes.

Formally, given a set of resolutions

R

and a model

M

with pretrained weights

θ_{p r e}

on a general dataset, our objective is to learn a model that generalizes well to all resolutions in

R

and adapts effectively to the new small-object dataset, i.e.,

θ^{*} = arg min_{θ} \underset{R_{i} \in R}{E} L (D, R_{i}, M),

(1)

where

D

denotes the dataset and

L

is the loss function. To solve the above meta-learning problem, we adopt the widely-used MAML (Model-Agnostic Meta-Learning) model [52], which is model-agnostic and applicable to any model trained with gradient descent. The iterative learning strategy of MAML is summarized in Algorithm 1. At each training iteration, our DRAT performs (1) An inner update: for each resolution

R_{i}

in the resolution set

R

, the parameters

θ_{i}^{'}

are updated from the generic parameter

θ

by sampling and training on a mini-batch of images

X_{i}

at that resolution, i.e.,

θ_{i}^{'} = θ - α \nabla_{θ} L_{R_{i}} (M (θ; X_{i})),

(2)

where

α

denotes the step size for the inner update. (2) An outer update: the generic parameter

θ

is updated through gradient descent, where the meta-loss is the summation of losses across all meta-tasks (resolutions), i.e.,

θ_{new} = θ - β \nabla_{θ} \sum_{R_{i} \in R} L_{R_{i}} (M (θ_{i}^{'}; X_{i})) .

(3)

The updated generic parameter

θ_{new}

is then used for the next iteration. When the training is complete, the generic parameter becomes the final learned parameter of the model, and the model that adapts well to all resolutions is utilized for transferring to the small-object detection dataset. Note that for the consideration of training efficiency, we use the first-order approximation in MAML, which discards the production of the Hessian matrix and has a similar training speed to traditional training in detection.

Algorithm 1 Dynamic resolution adaptation meta-learning

Require: Adapting resolution set $R$ , model $M$ , dataset $D$ , training loss $L$ .
Require: Meta-learning step sizes of inner update and outer update $α$ and $β$ .
Initialize model $M$ with pretrained weights $θ \leftarrow θ_{p r e}$ ;
while training not complete do
for all $R_{i} \in R$ do
Sample a batch of images $X_{i}$ and resize to $R_{i}$ ;
Evaluate $\nabla_{θ} L_{R_{i}} (M (θ; X_{i}))$ ;
Compute adapted weights using gradient descent:
$θ_{i}^{'} = θ - α \nabla_{θ} L_{R_{i}} (M (θ; X_{i}))$ ;
end for
Update $θ \leftarrow θ - β \nabla_{θ} \sum_{R_{i} \in R} L_{R_{i}} (M (θ_{i}^{'}; X_{i}))$ ;
end while
return Adapted model $M$ with weights $θ$ .

3.2. Enhanced Small-Object Detection for Transfer Learning

In order to enhance the performance of small-object detection modules, previous works usually add larger feature maps to the FPN [53], enhance the local information of shallower feature maps [54], and introduce new feature fusion strategies to the FPN [55]. Nevertheless, these methods usually require new network modules, which introduce new learnable parameters and significantly change the output features of the FPN. Note that for few-shot small-object detection, we transfer a pretrained model to a few-shot dataset, where the pretrained model is trained with a normal architecture designed for normal datasets. Directly injecting existing learnable and randomly initialized modules into the pretrained model would significantly change the semantic information present in the original features, leading to catastrophic forgetting of the pretrained knowledge.

In this paper, we propose a new FPN with a shifted feature aggregation (SFA-FPN) module and an anchor relation module to enhance small-object detection performance without disturbing the semantic information in the original model.

3.2.1. FPN with Shifted Feature Aggregation (SFA-FPN)

In a conventional FPN, the features at different scales are connected sequentially, and the feature map with a lower resolution is upsampled using linear interpolation to one with a higher resolution to perform a summation with its previous feature map. However, this naïve upsample interpolation can perturb the high-resolution feature map by assigning non-associated features to multiple pixels, as shown in Figure 3. As a result, the high-resolution feature maps, which are important for detecting small objects, are significantly polluted by low-resolution ones, leading to inferior performance.

In this paper, we aim to solve the above problem by learning to interpolate low-resolution features into high-resolution ones. Unlike previous methods that change the output features using learnable interpolation functions or additional convolutional layers, we propose a simple yet effective approach that involves shifting and aggregating pixels without changing their feature distributions. Specifically, as shown in Figure 4, for each pixel in an interpolated feature map

F \in R^{C \times 2 H \times 2 W}

with a

2 \times

upsample ratio, where C, W, and H denote the channels, width, and height of the feature map, respectively, we first shift it along eight directions to obtain the shifted features. All nine features

F^{(s)} \in R^{9 \times C \times 2 H \times 2 W}

(including the original one) represent the possible true positions of the pixel. Then, we introduce a simple convolution module to predict the weights

W \in R^{9 \times 2 H \times 2 W}

for the nine feature maps by concatenating the feature map

F

with the feature map from its previous stage, which denotes the probabilities of the pixels being in the correct position. In other words, if a pixel in the interpolated feature map is not in the correct position, it should be discarded and has a weight of 0, whereas the pixel with the correct position should have a weight of 1. Multiplying the weights with the shifted feature maps and performing summation on the weighted feature maps yields our refined feature map

F^{(r)} \in R^{C \times 2 H \times 2 W}

, i.e.,

F^{(r)} = \sum_{i = 1}^{9} W_{i} ⊙ F_{i}^{(s)},

(4)

where ⊙ denotes the Hadamard product, and summation is performed element-wise in the first dimension (dimension C).

After the shifted feature aggregation, we also adopt a channel attention module based on an SE module [56] to select and enhance the aggregated feature map

F^{(r)}

. Specifically, as illustrated in Figure 5, the module first computes the global image features on channels by averaging across the spatial axes. Then, a squeeze-and-excitation structure with a reduction convolution, followed by an activation function and an expansion convolution, is introduced to extract the features and reduce the computational cost. Finally, we use a convolution layer to predict the attention weights of each channel and multiply these weights onto the refined feature map. This channel attention module adaptively assigns larger weights to valuable channels and smaller weights to noisy channels, helping us further reduce the disturbances caused by fusing two feature maps.

3.2.2. Anchor Relation Module

The small objects in images are usually blurred and low resolution, and thus the anchors of small objects are not as discriminating as the normal anchors of larger objects. In this paper, to enhance the region-of-interest (RoI) features of small objects, we propose an anchor relation module to model the relations of different anchors and improve the anchor features with the related anchors. The motivation behind this module is that for recognizing the object in one anchor, the content in other anchors can benefit the recognition through their relations, e.g., (1) similar objects with better image quality; (2) different parts of an object; (3) dependencies between objects (for instance, in mining scenarios, safety helmets co-occur with workers).

Our relation module leverages transformer blocks [57] to effectively model the relations between anchors. Specifically, given the input anchor features

A \in R^{N \times C_{A}}

and their coordinates

P \in R^{N \times 4}

, where N and

C_{A}

denote the number of anchors in an image and the channels of an anchor, and each coordinate contains the positions

(x, y)

of the top-left and bottom-right points, we first bind the positional information to each anchor by adding the positional encoding to the anchor features. Following DETR [58], we generalize the positional encoding of Transformer to the 2D image scenario. For an anchor with normalized coordinates

(x_{0}, y_{0}, x_{1}, y_{1}) \in {[0, 1]}^{4}

, its positional encoding is defined as

P = [P E (x_{0}) : P E (y_{0}) : P E (x_{1}) : P E (y_{1})]

, where

[:]

denotes concatenation and the function

P E

is formulated as

\begin{matrix} \begin{matrix} P E {(a)}_{2 i} & = sin (a / 10, 000^{2 i / D}) \\ P E {(a)}_{2 i + 1} & = cos (a / 10, 000^{2 i / D}) \end{matrix} \end{matrix}

(5)

where

D = C_{A} / 4

is the dimensions of

P E

. Then, the anchor features with positional encoding added are fed into a Transformer encoder to extract the relation features of the anchors. We use the output features of the encoder as the input features of the prediction heads (classification head and regression head) of the model.

The overall architecture of the anchor relation module is illustrated in Figure 6. Our encoder is stacked with L (

L = 2

in our experiments) transformer blocks. With an input sequence of features

I = A + P

, each transformer block computes it with a multi-head self-attention module and a multi-layer perception module. (1) Multi-head self-attention (MHSA): MHSA measures the similarities between each pair of features and then combines the features to obtain the output features based on the attention scores. Formally, with

I \in R^{N \times C_{A}}

, the output

O_{A}

of MHSA is computed as

O_{A} : = A t t e n t i o n (I_{q}, I_{k}, I_{v}) = s o f t m a x (I_{q} I_{k}^{T}) I_{v},

(6)

where

I_{q}, I_{k}, I_{v}

are produced by the Q, K, V projections of

I

, respectively. Then,

O_{A}

is fed into a multi-layer perception (also referred to as feed-forward network (FFN)), which consists of two fully-connected (FC) layers with an adjunct activation function, i.e.,

O_{E} = F C (A c t (F C (O_{A}))) .

(7)

The output of the last transformer block is then passed to the regression head and classification head of the detection model to generate the predictions.

The overall network structure of TranSDet is illustrated in Figure 7. It contains shifted feature aggregation (SFA) and channel attention (CA) modules in the FPN and an anchor relation (AR) module before the prediction heads.

4. Experiments

4.1. Datasets and Evaluation Metrics

We chose the COCO dataset as the base dataset for this study, which is a widely used benchmark dataset in the field of object detection. Released by Microsoft in 2014, it contains over 330,000 well-labeled images of common objects from 80 different categories, with multiple objects in each image. Compared with VOC and ILSVRC (ImageNet), COCO contains smaller objects with a denser distribution, making it more similar to real-world scenarios. COCO has become the standard dataset for object detection.

For the novel datasets, we chose the TT100K [22] and BUUISE-MO datasets [11]. TT100K is a dataset for traffic sign detection and classification, whereas BUUISE-MO is a mining object detection dataset.

The evaluation metric used to assess the model detection precision in this paper was the average precision (AP). We used the AP

_{50}

, AP

_{S}

, AP

_{M}

, and AP

_{L}

to evaluate the detection capabilities of objects of different sizes on the TT100K and BUUISE-MO datasets. Among them, AP

_{50}

represents the average precision when the intersection over union (IoU) is greater than or equal to 0.5, and AP

_{S}

, AP

_{M}

, and AP

_{L}

represent the average precision of small, medium, and large objects, respectively. For the COCO dataset, we report the standard AP, AP

_{50}

, AP

_{75}

, AP

_{S}

, AP

_{M}

, and AP

_{L}

. We ran all the methods five times with different seeds and present the mean and the standard deviation as the final scores.

4.2. Models and Training Strategies

Our experiments were conducted on a computer equipped with an Intel Core i9-7900X CPU (3.3G) and an NVIDIA TITAN V GPU (12G). We used the MMDetection deep learning framework [59] with the default epoch value for the training process. We compared three object detection models: Faster R-CNN (two-stage), RetinaNet (one-stage), and YOLOv3 (efficient). Note that we only adopted SFA-FPN in RetinaNet and YOLOv3, since these networks do not contain RoI features needed for leveraging the anchor relation module. For the training strategies, we trained the models using an SGD optimizer with a momentum of 0.9 and a weight decay of

10^{- 4}

. A step learning rate schedule, which decayed the learning rate by a factor of 0.1 at the 8th and 11th epochs, was adopted with an initial value of 0.02. The training used standard data augmentations, including resizing, random flipping, and padding. The number of training epochs was set to 12 for the COCO dataset, whereas for the small TT100K-Lite and BUUISE-MO datasets, we trained the models for 36 epochs. The height resolutions of the input images on the COCO, TT100K-Lite, and BUUISE-MO datasets were resized to 800, 1440, and 1080, respectively, and the image widths were adjusted to maintain the original aspect ratios of the images.

4.3. Results on TT100K

The TT100K dataset contains 100,000 high-resolution (2400 × 2400) images and has been widely utilized for traffic sign detection and classification. With 30,000 instances of traffic signs, this dataset is an excellent benchmark for small-object detection, as it contains a considerable number of small objects. To address the imbalance in the number of instances across the different traffic sign classes, we excluded classes with fewer than 100 instances [22], resulting in 45 classes for our experiments.

We first conducted experiments to investigate the impact of the number of training images on the detection precision in small-object detection tasks. Faster R-CNN [1] is a highly accurate and scalable object detection algorithm that has gained attention for its performance in such tasks. We trained the Faster R-CNN model from scratch on the TT100K-Lite dataset, which is a subset of the original TT100K dataset containing only 45 categories. To sufficiently evaluate our performance on different numbers of training samples, we created four training sets with different proportions (100%, 10%, 5%, and 1%), where the proportions denote the randomly sampled proportions of the original training set. Our experimental results demonstrated that the number of training images had a significant impact on the detection performance of Faster R-CNN, as shown in Table 2. A limited number of training images led to a sharp decline in detection precision, indicating that the model’s ability to detect objects was significantly reduced when the number of training images was limited. This decline in precision was particularly pronounced for small objects, as evidenced by the decreasing AP values as the proportion of training images decreased. These findings highlight the importance of transfer learning and dynamic resolution adaptation, as proposed in this paper, in enhancing object detection performance in few-shot scenarios. By leveraging transfer learning and dynamically adjusting the resolution of the input images during training, our proposed approach can effectively address the challenges of limited training samples and low-quality images, leading to significant improvements in small-object detection performance.

The aim of this paper was to address the above-mentioned problem by proposing an effective small-object detection method for transfer learning. Here, we conducted experiments to show the efficacy of our TranSDet model compared to various object detection methods, including Faster R-CNN [1], RetinaNet [2], and YOLOv3 [33]. RetinaNet [2] is a one-stage algorithm that employs a focal loss to address the class imbalance problem in object detection, whereas YOLOv3 [33] is an efficient detection algorithm that utilizes a fully convolutional neural network to detect objects. In comparative experiments, Faster R-CNN (FRCNN), RetinaNet, and YOLOv3 were used as typical algorithms to evaluate the performance of TranSDet. The experimental results in Table 3 show that TranSDet consistently outperformed the baseline algorithms on all three subsets of the TT100K-Lite dataset, with a significant improvement in detection accuracy for small-object detection tasks. Specifically, for the 10% subset, TranSDet achieved an 8% absolute improvement in detection accuracy for Faster R-CNN, a 22.7% improvement for RetinaNet, and a 6.6% improvement for YOLOv3, compared to their respective baselines. Similar improvements were observed for the 5% and 1% subsets, where TranSDet consistently outperformed the baseline algorithms across all three algorithms and achieved notable improvements in detection accuracy.

Furthermore, we conducted experiments on recently proposed and more advanced object detection models: RetinaNet with Swin Transformer [60], anchor-free RepPoints [61], and end-to-end Deformable DETR [62]. As summarized in Table 4, the advanced models also suffered due to the limited training samples in the TT100K-Lite small-object detection dataset, whereas our TranSDet achieved significant improvements over them, demonstrating the generality and effectiveness of our method. Specifically, on the 10% TT100K-Lite dataset, Deformable DETR only achieved an AP

_{50}

of 27.8 since its Transformer-based detection head required a large number of training samples to converge and avoid overfitting, whereas when using our proposed TranSDet model, performance was improved by 27.0.

It is worth noting that the performance of all the algorithms generally decreased as the dataset size decreased. However, we observed that TranSDet consistently exhibited performance improvements over the baselines, even when the proportion of training data was as low as 1%. These results suggest that TranSDet is effective in improving small-object detection and can robustly adapt to different dataset sizes.

4.4. Results on BUUISE-MO

The dataset used in this study was derived from the BUUISE-MO dataset, which is an open-pit-mine object detection dataset established by the team at the Beijing Information Service Engineering Key Laboratory of Beijing Union University [11]. The BUUISE-MO dataset comprises 9720 images with a resolution of 1920 × 1080, including 7220 training images and 2500 test images, as shown in Figure 8. The dataset has 15 categories, including truck, forklift, car, excavator, people, sign, etc. A total of 6041 large objects, 9230 medium objects, and 12,043 small objects are labeled, making the dataset suitable for small-object detection tasks.

To create a few-shot dataset, we randomly selected 610 images from the training set and 500 images from the test set. We refer to this dataset as BUUISE-MO-Lite in this paper. Similar to the previous experiment, we trained Faster R-CNN, RetinaNet, and YOLOv3, and the results are shown in Table 5. The primary objective of this experiment was to demonstrate the ability of the TranSDet method to generalize well across different datasets and algorithms, given that the previous experiment (Table 3) had already demonstrated the effectiveness of TranSDet in improving small-object detection accuracy. The results showed that TranSDet has good generalization abilities, as it significantly improved the detection accuracy for all three algorithms on the BUUISE-MO-Lite dataset. The consistently better performance of TranSDet in small-object detection reaffirms its effectiveness, which can be beneficial in scenarios where large labeled datasets are not available. Notably, compared to the Faster R-CNN method, our method achieved more significant improvements on efficient one-stage detectors RetinaNet and YOLOv3. This demonstrates that our method also adapts well to detectors with limited FLOPs and parameters, and can help those detectors achieve competitive performance compared to the resource-heavy two-stage detector Faster R-CNN.

4.5. Small-Object Results on COCO Dataset

We also conducted experiments on the large-scale general COCO dataset to validate the efficacy of our proposed small-object modules, namely the FPN with shifted feature aggregation (SFA-FPN) module and the anchor relation (AR) module. We trained Faster R-CNN with a ResNet-50 backbone using the standard

1 \times

training schedule in MMDetection [59], and present the standard benchmark metrics of COCO in Table 6. Specifically, AP

_{S}

, AP

_{M}

, and AP

_{L}

are the AP

_{50}

on small, medium, and large objects, respectively. The results show that both the SFA-FPN and AR modules can improve detection performance on COCO, especially for small-object tasks. Our model achieved a

22.1 %

AP

_{S}

and significantly surpassed the baseline by

0.9 %

. This indicates that our proposed network modules not only benefits transfer learning in detection but also improves performance on the large-scale general COCO dataset without using transfer learning.

4.6. Comparison with Transfer Learning Methods

We conducted experiments to compare our dynamic resolution adaptation (DRA) with existing transfer learning methods [43,63] designed for Faster R-CNN object detection. Specifically, a pretrained model fine-tuning method (PTD) [63] was considered as our baseline, which involved using a pretrained COCO model to initialize the target dataset. Frozen layer fine-tuning [43] was used to freeze the preceding layers to better preserve the semantic information from the source dataset.

The results presented in Table 7 demonstrate that TranSDet outperformed the other methods across various evaluation metrics. In particular, TranSDet achieved the highest AP scores at different IoU thresholds, including AP

_{50}

, AP

_{S}

, AP

_{M}

, and AP

_{L}

, indicating its superior performance in accurately detecting objects of different sizes. When comparing TranSDet with PTD, we found that TranSDet showed significant improvements in overall object detection performance, as measured by the AP

_{50}

. Additionally, TranSDet outperformed both PTD and FTD in terms of AP

_{S}

, AP

_{M}

, and AP

_{L}

, indicating its superiority in detecting small, medium, and large objects.

However, although the performance of PTD and TranSDet was reasonably stable across different proportions (percentage of samples in the dataset), FTD’s performance decreased significantly, indicating its sensitivity to changes in object size distribution in the target dataset. This finding highlights the importance of carefully selecting an appropriate transfer learning method for different datasets based on the target distribution of objects. Note that our DRA only had to change the weights of the pretrained model in transfer learning, making it compatible with almost all the existing transfer learning methods. Therefore, one can easily implement DRA on more sophisticated transfer learning methods to improve small-object detection performance.

4.7. Ablation Study

4.7.1. Ablation on the Proposed Modules

We first investigated the effects of our proposed dynamic resolution adaptation (DRA), SFA-FPN, and anchor relation (AR) modules. As shown in Table 8, our baseline Faster R-CNN with a ResNet-50 backbone achieved an AP

_{50}

of

50.2 %

, whereas adding DRA (second row) significantly improved the AP

_{50}

and AP

_{S}

by 5.7% and 3.9%, respectively. Meanwhile, adopting SFA-FPN (third row) achieved a

6.2 %

improvement in AP

_{50}

compared to the baseline, and leveraging AR achieved a further improvement of

2.8 %

. By combining all the proposed modules, our final method achieved the optimal performance of a

61.6 %

AP

_{50}

and a 44.8% AP

_{S}

and significantly outperformed our baseline by

11.4 %

and

4.4 %

, respectively.

4.7.2. Comparison with Previous Small-Object Detection Methods

This experiment aimed to demonstrate the effectiveness of the TranSDet method for small-object detection. To evaluate its performance, comparison experiments were conducted using Faster R-CNN (FRCNN), Carafe [26], and FPG [27]. Both Carafe and FPG are designed for improving the small-object detection performance of the Faster R-CNN baseline. Carafe employs a learnable upsampling module to capture fine-grained information, whereas FPG enhances the feature pyramid structure to better capture object details at all scales. These modifications have been shown to significantly improve performance in small-object detection tasks on conventional datasets, indicating their efficacy in this context.

In Table 9, we present the comparison results when using the same Faster R-CNN model trained on the COCO dataset to initialize the model in the TT100K-Lite dataset but we replaced the neck with Carafe, FPG, and our approaches. The results show that TranSDet outperformed both Carafe and FPG in all evaluation metrics, with a significantly higher AP at all levels (AP

_{50}

, AP

_{S}

, AP

_{M}

, and AP

_{L}

). This demonstrates the superiority of the TranSDet structure in small-object detection tasks. Also, it is worth noting that using the existing Carafe and FPG modules to replace the original FPN module resulted in poorer object detection performance in transfer learning. This can be attributed to the random initialization of additional weights in these modules during transfer learning, which may have disrupted the learned semantic information and degraded accuracy. In conclusion, the experimental results provide strong evidence to support the effectiveness of transfer learning for small-object detection using TranSDet. The use of Carafe and FPG in the comparison experiments further demonstrates the superiority of TranSDet in small-object detection tasks.

4.7.3. Effect of Transferring from a Small-Object Dataset

When comparing the small-object detection dataset to the large-scale general dataset, the latter usually contained more diverse samples. Therefore, the model trained on it is more suitable for transferring to other detection datasets. In order to clarify why we chose to adapt the model trained on a general dataset instead of directly transferring the model from a small-object dataset, we conducted experiments to transfer the Faster R-CNN model trained on the full TT100K small-object dataset to the BUUISE-MO-Lite dataset. As shown in Table 10, the model initialized with the TT100K pretrained weights obtained inferior results compared to the baseline, which only used the ILSVRC weights to initialize the backbone. This indicates that transferring the model from a small-object dataset with a specific scene would result in a loss of generality and discriminability of the model, leading to poorer performance. In contrast, our method adapted the model trained on the more diverse and general COCO dataset for initialization, resulting in significant improvements compared to the baselines.

4.7.4. Effects of Different Adaptation Resolutions in DRA

In this paper, we proposed DRA to adapt the pretrained normal model to the small-object detection task by fine-tuning it with smaller resolutions on the normal dataset. Here, we conducted experiments to show the effects of the different resolution choices in our DRA approach. In Figure 9, we report the performance of our method on Faster R-CNN and the TT100K-Lite dataset with different maximum resolutions. Here, the maximum resolution denotes the maximum value in our dynamic resolution set, i.e., 640 denotes a resolution set containing 640 and all the resolution choices smaller than it

{640, 560, 480, 400, 320}

. The results show that directly leveraging the original weights trained on COCO resulted in performance degradation at both 10% and 5% data. This indicates that the normally trained weights are not suitable for small-object detection tasks. In contrast, with our DRA, the performance was significantly improved at all resolutions, and the maximum resolution of 480 achieved the best performance.

4.8. Complexity Analysis

Furthermore, we analyzed the time complexity of our method. Compared to the original transfer learning method, TranSDet used an additional dynamic resolution adaptation stage to fine-tune the model learned on the source dataset, which increased the training cost while achieving significant improvements. Additionally, in our adaptation stage, we proposed using meta-learning for better learning of dynamic resolutions. But this had the same training complexity (same training time and number of iterations) since we used a first-order approximation in MAML.

To demonstrate the complexity of our proposed detection networks, we present the FLOPs, number of parameters, and inference speed in Table 11. Overall, TranSDet struck a balance between performance and computational complexity. It introduced a slight increase in the FLOPs and parameters compared to the baseline models, indicating reasonable optimization. Regarding the inference speed, the impact was marginal. Consequently, our method is efficient, leading to a significant improvement in performance with only a small computation overhead.

5. Conclusions

Our proposed method, TranSDet, offers a promising solution for small-object detection in deep learning. By leveraging transfer learning and augmenting training images with diverse smaller resolutions, TranSDet addresses the challenges of limited training samples and low-quality images. The dynamic resolution adaptation scheme ensures consistent performance on various object sizes, and the two network components, the FPN with shifted feature aggregation and the anchor relation module, can effectively improve small-object detection accuracy. Extensive experiments and ablation studies demonstrate the effectiveness and efficiency of TranSDet in improving small-object detection accuracy, where it outperformed existing state-of-the-art methods on various datasets. Our study provides valuable insights into designing transfer learning-based models for small-object detection and offers a promising solution for real-world applications, especially in scenarios where large labeled datasets are not available.

Furthermore, TranSDet is not limited to small-object recognition in computer vision. It can also be utilized for other tasks such as semantic segmentation, multi-label classification, and image classification. The transfer learning and adaptive resolution mechanisms employed in TranSDet can be extended to these tasks, enabling improved performance and generalization. This versatility makes TranSDet a valuable tool for a wide range of computer-vision applications beyond small-object detection.

Author Contributions

Conceptualization, X.X. and K.L.; methodology, X.X. and Y.M.; software, X.X.; validation, H.Z., Y.M. and K.L.; formal analysis, H.Z. and X.X.; investigation, H.Z.; resources, K.L.; data curation, H.Z.; writing—original draft preparation, X.X. and H.Z.; writing—review and editing, X.X. and H.B.; visualization, Y.M.; supervision, X.Q. and H.B.; project administration, X.Q. and K.L.; funding acquisition, H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by a key project of the National Nature Science Foundation of China (Grant No. 61932012); the Science and Technology Major Project of Shanxi Province, China (Grant No. 202101090301013); the Beijing Municipal Education Commission Science and Technology Program (Grant No. KM202111417007); and the Academic Research Projects of Beijing Union University (Grant No. ZK80202003).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their valuable comments that greatly improved our manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2017; Volume 39, pp. 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
Shivappriya, S.N.; Priyadarsini, M.J.P.; Stateczny, A.; Puttamadappa, C.; Parameshachari, B.D. Cascade Object Detection and Remote Sensing Object Detection Method Based on Trainable Activation Function. Remote Sens. 2021, 13, 200. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed Object Detection. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2022; Volume 44, pp. 6024–6042. [Google Scholar] [CrossRef]
Nnadozie, E.C.; Iloanusi, O.N.; Ani, O.A.; Yu, K. Detecting Cassava Plants under Different Field Conditions Using UAV-Based RGB Images and Deep Learning Models. Remote Sens. 2023, 15, 2322. [Google Scholar] [CrossRef]
Wu, J.; Xu, W.; He, J.; Lan, M. YOLO for Penguin Detection and Counting Based on Remote Sensing Images. Remote Sens. 2023, 15, 2598. [Google Scholar] [CrossRef]
Musunuri, Y.R.; Kwon, O.S.; Kung, S.Y. SRODNet: Object Detection Network Based on Super Resolution for Autonomous Vehicles. Remote Sens. 2022, 14, 6270. [Google Scholar] [CrossRef]
Liang, T.; Bao, H.; Pan, W.; Fan, X.; Li, H. DetectFormer: Category-Assisted Transformer for Traffic Scene Object Detection. Sensors 2022, 22, 4833. [Google Scholar] [CrossRef]
Rasol, J.; Xu, Y.; Zhang, Z.; Zhang, F.; Feng, W.; Dong, L.; Hui, T.; Tao, C. An Adaptive Adversarial Patch-Generating Algorithm for Defending against the Intelligent Low, Slow, and Small Target. Remote Sens. 2023, 15, 1439. [Google Scholar] [CrossRef]
Xu, X.; Zhao, S.; Xu, C.; Wang, Z.; Zheng, Y.; Qian, X.; Bao, H. Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks. Drones 2023, 7, 250. [Google Scholar] [CrossRef]
Song, R.; Ai, Y.; Tian, B.; Chen, L.; Zhu, F.; Yao, F. MSFANet: A Light Weight Object Detector Based on Context Aggregation and Attention Mechanism for Autonomous Mining Truck. In IEEE Transactions on Intelligent Vehicles; IEEE: Piscataway, NJ, USA, 2023; Volume 8, pp. 2285–2295. [Google Scholar] [CrossRef]
Huang, L.; Zhang, X.; Yu, M.; Yang, S.; Cao, X.; Meng, J. FEGNet: A feature enhancement and guided network for infrared object detection in underground mines. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2023, 09544070231165627. [Google Scholar] [CrossRef]
Naz, S.; Ashraf, A.; Zaib, A. Transfer learning using freeze features for Alzheimer neurological disorder detection using ADNI dataset. Multimed. Syst. 2022, 28, 85–94. [Google Scholar] [CrossRef]
Chen, J.; Sun, J.; Li, Y.; Hou, C. Object detection in remote sensing images based on deep transfer learning. Multimed. Tools Appl. 2022, 81, 12093–12109. [Google Scholar] [CrossRef]
Neupane, B.; Horanont, T.; Aryal, J. Real-Time Vehicle Classification and Tracking Using a Transfer Learning-Improved Deep Learning Network. Sensors 2022, 22, 3813. [Google Scholar] [CrossRef]
Ghasemi Darehnaei, Z.; Shokouhifar, M.; Yazdanjouei, H.; Rastegar Fatemi, S.M.J. SI-EDTL: Swarm intelligence ensemble deep transfer learning for multiple vehicle detection in UAV images. Concurr. Comput. Pract. Exp. 2022, 34, e6726. [Google Scholar] [CrossRef]
Narmadha, C.; Kavitha, T.; Poonguzhali, R.; Hamsadhwani, V.; Jegajothi, B. Robust Deep Transfer Learning Based Object Detection and Tracking Approach. Intell. Autom. Soft Comput. 2023, 35, 3613–3626. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef] [Green Version]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Han, J. Towards large-scale small object detection: Survey and benchmarks. arXiv 2022, arXiv:2207.14096. [Google Scholar] [CrossRef]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 936–944. [Google Scholar] [CrossRef] [Green Version]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Chen, K.; Cao, Y.; Loy, C.C.; Lin, D.; Feichtenhofer, C. Feature pyramid grids. arXiv 2020, arXiv:2004.03580. [Google Scholar]
Zhang, H.; Wang, K.; Tian, Y.; Gou, C.; Wang, F.Y. MFR-CNN: Incorporating Multi-Scale Features and Global Information for Traffic Object Detection. In IEEE Transactions on Vehicular Technology; IEEE: Piscataway, NJ, USA, 2018; Volume 67, pp. 8019–8030. [Google Scholar] [CrossRef]
Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7369–7378. [Google Scholar] [CrossRef]
Huang, J.; Shi, Y.; Gao, Y. Multi-Scale Faster-RCNN Algorithm for Small Object Detection. J. Comput. Res. Dev. 2019, 56, 319–327. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Qi, G.; Zhang, Y.; Wang, K.; Mazur, N.; Liu, Y.; Malaviya, D. Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion. Remote Sens. 2022, 14, 420. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
Zhang, J.; Xu, D.; Li, Y.; Zhao, L.; Su, R. FusionPillars: A 3D Object Detection Network with Cross-Fusion and Self-Fusion. Remote Sens. 2023, 15, 2692. [Google Scholar] [CrossRef]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 354–370. [Google Scholar] [CrossRef] [Green Version]
Bosquet, B.; Mucientes, M.; Brea, V.M. STDnet: Exploiting high resolution feature maps for small object detection. Eng. Appl. Artif. Intell. 2020, 91, 103615. [Google Scholar] [CrossRef]
Wu, B.; Shen, Y.; Guo, S.; Chen, J.; Sun, L.; Li, H.; Ao, Y. High Quality Object Detection for Multiresolution Remote Sensing Imagery Using Cascaded Multi-Stage Detectors. Remote Sens. 2022, 14, 2091. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; Huang, T.; Gonzalez, J.; Darrell, T.; Yu, F. Frustratingly Simple Few-Shot Object Detection. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 9919–9928. [Google Scholar]
Liang, G.; Zheng, L. A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Comput. Methods Programs Biomed. 2020, 187, 104964. [Google Scholar] [CrossRef]
Wang, X.; Shen, C.; Xia, M.; Wang, D.; Zhu, J.; Zhu, Z. Multi-scale deep intra-class transfer learning for bearing fault diagnosis. Reliab. Eng. Syst. Saf. 2020, 202, 107050. [Google Scholar] [CrossRef]
Loey, M.; Manogaran, G.; Taha, M.H.N.; Khalifa, N.E.M. A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 2021, 167, 108288. [Google Scholar] [CrossRef]
Tang, Y.P.; Wei, X.S.; Zhao, B.; Huang, S.J. QBox: Partial Transfer Learning with Active Querying for Object Detection. In IEEE Transactions on Neural Networks and Learning Systems; IEEE: Piscataway, NJ, USA, 2021; pp. 1–13. [Google Scholar] [CrossRef]
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7348–7358. [Google Scholar] [CrossRef]
Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic Relation Reasoning for Shot-Stable Few-Shot Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8778–8787. [Google Scholar] [CrossRef]
Kaul, P.; Xie, W.; Zisserman, A. Label, Verify, Correct: A Simple Few Shot Object Detection Method. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14217–14227. [Google Scholar] [CrossRef]
Yan, D.; Zhang, H.; Li, G.; Li, X.; Lei, H.; Lu, K.; Zhang, L.; Zhu, F. Improved Method to Detect the Tailings Ponds from Multispectral Remote Sensing Images Based on Faster R-CNN and Transfer Learning. Remote Sens. 2022, 14, 103. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6 –11 August 2017; pp. 1126–1135. [Google Scholar]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for Small Object Detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]
Xu, F.; Wang, H.; Peng, J.; Fu, X. Scale-aware feature pyramid architecture for marine object detection. Neural. Comput. Appl. 2021, 33, 3637–3653. [Google Scholar] [CrossRef]
Peng, F.; Miao, Z.; Li, F.; Li, Z. S-FPN: A shortcut feature pyramid network for sea cucumber detection in underwater images. Expert Syst. Appl. 2021, 182, 115306. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October– 2 November 2019; pp. 9657–9666. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable {DETR}: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Comparison of object detection performance of Faster R-CNN using different pretrained models on the TT100K-Lite dataset (10% proportion). w/o transfer: the standard training strategy that uses an ILSVRC-pretrained backbone for initialization. COCO: transferring from the model trained on the COCO dataset. Ours: COCO model adapted with our proposed dynamic resolution adaptation scheme.

Figure 2. Framework of our proposed dynamic resolution adaptation transfer (DRAT) learning. Conventional transfer learning methods directly use the model pretrained on a base dataset to fine-tune the target few-shot dataset (stage I and stage III). We propose a dynamic resolution adaptation (stage II) to adapt the pretrained model to a small-object detection task and improve transfer learning performance.

Figure 3. Comparisons of upsampling effects on different scales of images. We use a real detection image (top) and a simple curved line (bottom) for illustrations. The pixels are significantly polluted by their neighboring salient pixels when the stride is large.

Figure 4. FPN with shifted feature aggregation.

Figure 5. Channel attention module.

Figure 6. The architecture of the anchor relation module.

Figure 7. TranSDet network structure.

Figure 8. Image samples in the BUUISE-MO-Lite dataset.

Figure 9. Comparison between different values of maximum resolutions in our dynamic resolution adaptation on TT100K-Lite. none: training on TT100K-Lite without transfer learning. ori.: transfer learning with the original pretrained weights on COCO.

Table 1. Distributions of small, medium, and large objects in popular object detection training datasets.

Dataset		Small Objects	Medium Objects	Large Objects
general object datasets	ILSVRC 2012 [20]	1.64%	11.54%	86.81%
	VOC 2007 [21]	11.20%	34.52%	54.28%
	VOC 2012 [21]	9.54%	27.59%	62.89%
	COCO 2017 [19]	31.13%	34.90%	33.97%
small object datasets	TT100K 2016 [22]	41.28%	51.66%	7.06%
	BUUISE-MO [11]	44.09%	33.79%	22.12%
	SODA-D [23]	48.20%	28.18%	23.62%
	Tiny Person [24]	85.80%	11.54%	2.66%

Table 2. Object detection performance of Faster R-CNN baseline on the TT100K-Lite dataset.

Proportion (%)	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
100	89.4 ± 0.90	75.4 ± 3.11	97.8 ± 0.78	91.7 ± 1.79
10	63.3 ± 0.53	48.2 ± 1.38	78.0 ± 1.22	79.4 ± 4.58
5	50.2 ± 0.98	40.4 ± 1.02	62.2 ± 1.35	59.1 ± 4.46
1	23.4 ± 1.44	18.5 ± 2.83	30.8 ± 0.94	43.3 ± 5.45

Table 3. Performance comparison of our object detection model and classical models on the TT100K-Lite dataset with varying proportions of training data.

Pros. (%)	Method	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
10	FRCNN [1]	63.3 ± 0.53	48.2 ± 1.38	78.0 ± 1.22	79.4 ± 4.58
	FRCNN-TranSDet	71.3 ± 1.05	53.4 ± 1.59	85.0 ± 1.87	83.3 ± 3.97
	RetinaNet [2]	33.7 ± 2.34	33.3 ± 4.99	42.5 ± 2.66	55.7 ± 8.48
	RetinaNet-TranSDet	56.4 ± 0.98	45.9 ± 3.90	68.4 ± 1.89	68.2 ± 7.86
	YOLOv3 [33]	24.9 ± 0.98	15.6 ± 2.83	32.6 ± 2.34	37.0 ± 4.11
	YOLOv3-TranSDet	31.5 ± 0.32	19.6 ± 1.45	38.2 ± 2.05	41.7 ± 5.98
5	FRCNN [1]	50.2 ± 0.98	40.4 ± 1.02	62.2 ± 1.35	59.1 ± 4.46
	FRCNN-TranSDet	61.6 ± 0.72	44.8 ± 1.50	74.1 ± 1.19	70.1 ± 2.44
	RetinaNet [2]	17.1 ± 2.15	18.1 ± 1.72	21.5 ± 2.25	37.7 ± 8.56
	RetinaNet-TranSDet	44.1 ± 2.10	37.8 ± 0.59	56.7 ± 1.69	52.6 ± 4.09
	YOLOv3 [33]	14.6 ± 0.99	8.1 ± 1.90	19.9 ± 0.55	30.9 ± 4.36
	YOLOv3-TranSDet	22.0 ± 2.14	10.9 ± 5.71	30.5 ± 1.16	34.4 ± 3.20
1	FRCNN [1]	23.4 ± 1.44	18.5 ± 2.83	30.8 ± 0.94	43.3 ± 5.45
	FRCNN-TranSDet	30.9 ± 1.16	26.4 ± 1.35	38.7 ± 1.42	46.4 ± 4.28
	RetinaNet [2]	2.2 ± 0.35	1.8 ± 0.45	3.6 ± 0.55	16.2 ± 4.08
	RetinaNet-TranSDet	17.8 ± 0.89	17.6 ± 1.23	24.6 ± 2.32	37.3 ± 2.90
	YOLOv3 [33]	3.8 ± 0.14	1.9 ± 0.35	5.1 ± 1.06	16.5 ± 2.40
	YOLOv3-TranSDet	9.2 ± 0.92	3.9 ± 2.76	14.7 ± 0.97	18.6 ± 3.88

Table 4. Performance comparison of our object detection model and advanced models on the TT100K-Lite dataset with varying proportions of training data.

Pros. (%)	Method	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
10	RetinaNet (Swin) [60]	36.4 ± 1.73	32.3 ± 2.81	46.9 ± 1.24	55.1 ± 3.32
	RetinaNet-TranSDet (Swin)	61.4 ± 1.46	50.2 ± 2.59	73.9 ± 1.92	67.2 ± 4.81
	RepPoints [61]	42.3 ± 2.13	36.9 ± 3.17	50.5 ± 1.62	56.0 ± 2.91
	RepPoints-TranSDet	68.8 ± 1.53	51.3 ± 1.16	78.6 ± 1.63	69.2 ± 3.54
	Def. DETR [62]	27.8 ± 2.81	13.7 ± 3.19	38.3 ± 1.96	50.4 ± 5.32
	Def. DETR-TranSDet	54.8 ± 1.35	32.6 ± 1.63	61.6 ± 1.26	63.2 ± 4.63
5	RetinaNet (Swin) [60]	22.4 ± 1.65	19.2 ± 1.92	29.5 ± 1.57	47.8 ± 5.21
	RetinaNet-TranSDet (Swin)	46.9 ± 1.42	40.5 ± 2.51	58.1 ± 1.52	55.7 ± 4.21
	RepPoints [61]	31.0 ± 1.43	27.4 ± 2.71	38.7 ± 1.12	45.7 ± 4.83
	RepPoints-TranSDet	51.7 ± 2.16	40.1 ± 1.74	63.7 ± 1.76	58.2 ± 2.95
	Def. DETR [62]	15.2 ± 1.84	14.9 ± 1.05	25.7 ± 1.46	35.8 ± 3.69
	Def. DETR-TranSDet	43.7 ± 1.61	35.3 ± 1.31	56.9 ± 1.66	51.5 ± 5.03
1	RetinaNet (Swin) [60]	2.30 ± 0.52	1.70 ± 0.38	3.30 ± 1.15	18.1 ± 3.64
	RetinaNet-TranSDet (Swin)	19.5 ± 1.73	18.9 ± 1.56	28.2 ± 1.27	42.3 ± 4.29
	RepPoints [61]	3.30 ± 1.13	3.20 ± 0.94	4.51 ± 1.75	14.1 ± 2.89
	RepPoints-TranSDet	19.5 ± 2.04	18.2 ± 1.53	26.7 ± 1.39	40.2 ± 3.15
	Def. DETR [62]	1.7 ± 0.51	1.8 ± 0.37	3.28 ± 1.04	11.2 ± 2.46
	Def. DETR-TranSDet	16.2 ± 1.76	15.8 ± 1.53	24.1 ± 1.64	36.6 ± 2.78

Table 5. Object detection performance on the proposed BUUISE-MO-Lite few-shot small-object detection dataset.

Method	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
FRCNN	57.6 ± 1.07	51.8 ± 0.55	66.6 ± 2.32	64.2 ± 1.88
FRCNN-TranSDet	61.6 ± 4.07	51.7 ± 1.96	68.2 ± 0.61	72.0 ± 7.76
RetinaNet	32.8 ± 1.53	28.4 ± 3.38	44.2 ± 1.26	46.7 ± 0.35
RetinaNet-TranSDet	65.0 ± 3.16	52.7 ± 2.25	65.9 ± 2.90	77.1 ± 2.87
YOLOv3	40.2 ± 4.19	16.1 ± 2.70	49.6 ± 7.28	55.4 ± 7.03
YOLOv3-TranSDet	53.0 ± 2.72	28.8 ± 6.84	59.1 ± 3.99	72.0 ± 4.45

Table 6. Object detection performance on the COCO dataset.

Method	AP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
FRCNN	37.4	58.1	40.4	21.2	41.0	48.1
+ SFA-FPN	37.8 (+0.4)	58.3 (+0.2)	40.8 (+0.4)	21.8 (+0.6)	41.3 (+0.3)	48.2 (+0.1)
+ SFA-FPN + AR	38.0 (+0.6)	58.4 (+0.3)	41.0 (+0.6)	22.1 (+0.9)	41.4 (+0.4)	48.4 (+0.3)

Table 7. Comparison of transfer learning methods on the Faster R-CNN model and the TT100K-Lite dataset.

Pros. (%)	Methods	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
10	PTD [63]	62.1 ± 1.79	46.6 ± 1.38	74.7 ± 1.79	86.2 ± 3.95
	FTD [43]	55.9 ± 0.71	42.9 ± 1.42	69.5 ± 1.25	72.5 ± 7.04
	TranSDet	71.3 ± 1.05	53.4 ± 1.59	85.0 ± 1.87	83.3 ± 3.97
5	PTD [63]	50.1 ± 0.75	38.4 ± 2.16	61.4 ± 1.49	61.9 ± 3.70
	FTD [43]	45.7 ± 1.26	35.1 ± 3.16	56.7 ± 5.63	58.7 ± 3.67
	TranSDet	61.6 ± 0.72	44.8 ± 1.50	74.1 ± 1.19	70.1 ± 2.44
1	PTD [63]	27.1 ± 2.30	22.1 ± 2.58	34.2 ± 1.46	48.7 ± 4.18
	FTD [43]	26.2 ± 1.12	21.5 ± 3.20	32.8 ± 1.36	41.7 ± 2.71
	TranSDet	30.9 ± 1.16	26.4 ± 1.35	38.7 ± 1.42	46.4 ± 4.28

Table 8. Ablation on the proposed modules on the TT100K-Lite dataset (5% proportion). We used Faster R-CNN as the baseline model. DRA: dynamic resolution adaptation. SFA-FPN: FPN with shifted feature aggregation. AR: anchor relation module.

DRA	SFA-FPN	AR	AP $_{50}$	AP $_{S}$
×	×	×	50.2 ± 0.98	40.4 ± 1.02
✓	×	×	55.9 ± 1.20	44.3 ± 1.49
×	✓	×	56.4 ± 1.81	43.5 ± 0.23
×	×	✓	59.2 ± 1.10	44.9 ± 2.15
×	✓	✓	59.2 ± 2.05	43.8 ± 2.44
✓	✓	✓	61.6 ± 0.72	44.8 ± 1.50

Table 9. Comparison with previous small-object detection methods on the TT100K-Lite dataset (10% proportion).

Method	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
FRCNN	67.0 ± 1.15	47.4 ± 3.07	80.3 ± 2.24	81.0 ± 1.39
Carafe [26]	66.5 ± 1.27	51.2 ± 1.16	78.4 ± 1.14	80.4 ± 5.34
FPG [27]	35.1 ± 1.61	4.3 ± 2.04	50.0 ± 2.52	62.4 ± 1.61
TranSDet (FRCNN as baseline)	71.3 ± 1.05	53.4 ± 1.59	85.0 ± 1.87	83.3 ± 3.97

Table 10. Comparison of transferring models trained on different datasets to the BUUISE-MO-Lite small-object dataset.

Pretrained Dataset	AP $_{50}$	AP $_{S}$	AP $_{M}$	AP $_{L}$
ILSVRC	59.4 ± 0.95	51.3 ± 1.07	70.5 ± 3.48	69.5 ± 3.87
TT100K	52.8 ± 0.49	48.1 ± 0.40	60.5 ± 1.33	59.3 ± 3.91
COCO	62.4 ± 1.45	51.0 ± 1.50	66.5 ± 1.66	71.7 ± 2.04
COCO-adapted (Ours)	65.0 ± 3.16	52.7 ± 2.25	65.9 ± 2.90	77.1 ± 2.87

Table 11. Object detection performance on the TT100K-Lite dataset. We measured the inference speed using PyTorch on a single NVIDIA TITAN Xp GPU.

Method	FLOPs (GFLOPs)	Params. (M)	Inference Speed (Image/Second)
FRCNN	206.89	41.35	5.8
FRCNN-TranSDet	267.70	48.71	5.3
RetinaNet	223.83	37.02	6.5
RetinaNet-TranSDet	237.06	39.10	6.0
YOLOv3	194.79	61.76	7.3
YOLOv3-TranSDet	202.47	63.00	7.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Zhang, H.; Ma, Y.; Liu, K.; Bao, H.; Qian, X. TranSDet: Toward Effective Transfer Learning for Small-Object Detection. Remote Sens. 2023, 15, 3525. https://doi.org/10.3390/rs15143525

AMA Style

Xu X, Zhang H, Ma Y, Liu K, Bao H, Qian X. TranSDet: Toward Effective Transfer Learning for Small-Object Detection. Remote Sensing. 2023; 15(14):3525. https://doi.org/10.3390/rs15143525

Chicago/Turabian Style

Xu, Xinkai, Hailan Zhang, Yan Ma, Kang Liu, Hong Bao, and Xu Qian. 2023. "TranSDet: Toward Effective Transfer Learning for Small-Object Detection" Remote Sensing 15, no. 14: 3525. https://doi.org/10.3390/rs15143525

APA Style

Xu, X., Zhang, H., Ma, Y., Liu, K., Bao, H., & Qian, X. (2023). TranSDet: Toward Effective Transfer Learning for Small-Object Detection. Remote Sensing, 15(14), 3525. https://doi.org/10.3390/rs15143525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TranSDet: Toward Effective Transfer Learning for Small-Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Small-Object Detection

2.2. Transfer Learning in Object Detection

3. Methodology

3.1. Dynamic Resolution Adaptation Transfer Learning

3.2. Enhanced Small-Object Detection for Transfer Learning

3.2.1. FPN with Shifted Feature Aggregation (SFA-FPN)

3.2.2. Anchor Relation Module

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Models and Training Strategies

4.3. Results on TT100K

4.4. Results on BUUISE-MO

4.5. Small-Object Results on COCO Dataset

4.6. Comparison with Transfer Learning Methods

4.7. Ablation Study

4.7.1. Ablation on the Proposed Modules

4.7.2. Comparison with Previous Small-Object Detection Methods

4.7.3. Effect of Transferring from a Small-Object Dataset

4.7.4. Effects of Different Adaptation Resolutions in DRA

4.8. Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI