1. Introduction
Object detection, a foundational task in computer vision, has applications spanning sectors like security, military, transportation, and healthcare [
1,
2,
3,
4]. Advances in deep learning have ushered in innovative algorithms and enhanced optimization methods, propelling object detection forward. However, these advancements heavily rely on extensive training datasets. Insufficient data can culminate in overfitting and hinder model generalization [
5,
6].
The development of remote sensing technology has heightened interest in its object detection capabilities [
7]. However, securing comprehensive data for intricate remote sensing scenarios and rare objects presents challenges. Gathering ample samples is time-intensive and manpower-reliant, with both sample quantity and quality critically influencing model performance [
8,
9].
Few-shot learning emerges as a promising solution to counteract the issues of diminished accuracy and generalization stemming from limited data. In the context of few-shot object detection, data categories are typically divided into base classes and novel classes. Base classes represent non-few-shot categories, which have abundant training data available. In contrast, novel classes are the few-shot categories, where only a limited amount of training data are available. These novel classes are the primary focus of few-shot object detection, as they pose a significant challenge due to the scarcity of training examples. Few-shot object detection strives to harness minimal data from novel classes, enhancing detection capabilities for these classes [
10]. Presently, the majority of few-shot object detection studies concentrate on natural scene datasets, such as VOC2007 [
11], VOC2012 [
12], and MS-COCO [
13], with substantial successes recorded [
14,
15,
16,
17]. However, in the remote sensing domain, the adaptation of few-shot detection is still nascent. Generally speaking, remote sensing images typically exhibit intricate backgrounds [
18], significant variations in target dimensions [
19], substantial intraclass variance, and comparatively minor interclass variance [
20]. The stylistic disparities between remote sensing and natural scene data mean that algorithms efficient in the latter might falter in the former. As illustrated in
Figure 1, the top row displays detection outcomes from the Two-stage Fine-tuning Approach (TFA) [
15], which is efficient in natural scenes, when applied to remote sensing datasets. In contrast, the bottom row presents results from the method introduced in this study for the identical test set. This visual comparison underscores the substantial performance variance of a single algorithm across different dataset types. Hence, it is imperative to delve into the realm of few-shot object detection algorithms that are tailored to cater to the intricacies of remote sensing images.
The two-stage detector, Faster RCNN [
21], serves as a foundational framework for few-shot object detection. A myriad of contemporary algorithms, rooted in Faster RCNN, consistently push the boundaries of detection accuracy [
22,
23,
24]. While Faster RCNN employs a two-stage process for object classification and regression, securing high precision, its speed often lags behind real-world requirements. The introduction of You Only Look Once (YOLO) [
25] marked a significant shift. In contrast to the two-stage counterparts, YOLO simultaneously outputs detected classes and their positions, offering a substantial speed boost. Its successors, namely YOLOv5 [
26], YOLOX [
27], and YOLOv6 [
28], have adeptly straddled the balance between speed and accuracy. However, the domain of few-shot object detection remains underexplored for these one-stage detectors. Addressing this gap and harnessing one-stage detectors for few-shot challenges emerges as a pressing research frontier.
This study centers on few-shot object detection within remote sensing data, employing a one-stage detector to harmonize precision with real-time operational needs. Leveraging YOLOv5s as the foundational architecture, transfer learning facilitates the transfer of insights from base classes to novel ones. To bolster model generalization amid acute data paucity, a Segmentation Assistance (SA) module is introduced. This module harnesses an object’s binary mask images as supervisory labels and adopts the binary cross-entropy as the loss function. This approach enables the model to precisely localize foreground targets amidst intricate backgrounds, thus facilitating the object detection network to prioritize its attention toward the foreground region. To mitigate the issue of performance degradation in base classes during fine-tuning, we propose a novel detection head termed the Triplet Head (Tri-Head). Building upon the foundational concept of the decoupled detection head, our approach incorporates three distinct heads aimed at acquiring varied knowledge. The first head is designed to encompass all categories, including both base and novel classes, serving as the ultimate detection output. The second head is dedicated to acquiring comprehensive base-class knowledge, thereby aiding the first head in revising the detection results of the base classes. Lastly, the third head focuses solely on learning novel-class knowledge, thereby bolstering the first head’s ability of detecting novel class. Specifically, we adopt the technique of knowledge distillation to achieve this objective, utilizing a dual knowledge distillation mechanism to effectively bolster the detection capability of the detection head across all categories.
Additionally, we introduce a refined classification loss function that dynamically adjusts the weights of various loss terms based on the predicted results. This approach enhances the model’s emphasis on challenging samples during training in order to address the difficulty of classification in remote sensing images.
The main contributions of this article can be summarized as follows:
- (1)
We propose a new few-shot object detection framework for remote sensing images based on a one-stage detector with a two-stage training strategy, which simultaneously ensures detection accuracy and speed.
- (2)
We propose a Segmentation Assistance module that enhances attention to foreground regions without increasing inference delay by using binary mask maps of foreground targets as additional supervised labels.
- (3)
We propose a Triplet Head, a detection head that incorporates a dual distillation mechanism to strengthen the model’s capacity to detect base classes while preserving its ability to detect novel classes, thereby addressing the challenge of forgetting base-class knowledge during fine-tuning.
- (4)
We optimize the classification loss function to adaptively adjust the weights of different samples based on the predicted results during the training process, enabling difficult samples to receive more attention.
3. Proposed Method
3.1. Preliminary Knowledge
3.1.1. Problem Definition
Based on previous studies [
23], the original dataset
is partitioned into two distinct subsets, the base-class subset
and the novel-class subset
, and
embodies categories abundant in training data, often termed non-few-shot categories. In contrast,
features categories identified as few-shot, marked by their limited training samples. These subsets are mathematically defined and related as
and
. Both base- and novel-class data are used for training, with accuracy assessments for each subset carried out during inference.
3.1.2. Baseline
The majority of previous works [
23,
24,
45] selected Faster RCNN as their baseline. This preference can be attributed to the early exploration of few-shot object detection on natural scene datasets, where the constraints on sample size often hampered the performance of one-stage detectors like YOLO. As a result, two-stage detectors emerged as the predominant focus. When the research direction shifted toward remote sensing datasets, many methods still adhered to the convention and adopted Faster RCNN as the reference model. Nonetheless, studies [
46,
47] have indicated the superior performance of YOLO over Faster RCNN, specifically for remote sensing datasets. Additionally, in terms of real-time application, Faster RCNN lags in speed, rendering it unsuitable for practical deployments [
48]. Given these considerations, this study selected YOLOv5’s s-version as the preferred baseline.
YOLOv5 primarily comprises three components: the backbone, neck, and head. The backbone employs CSP-Darknet53 for feature extraction, with its foundational convolutional block, CBS, consisting of a Convolution, a Batch Normalization, and a Silu activation function. The neck utilizes PANet, an enhancement over the traditional FPN, facilitating comprehensive feature integration from different levels. This can effectively address the issue of scale variation in remote sensing images. Lastly, the head, a staple of the YOLO series, employs a coupled structure for prediction. While this design minimizes parameter count and boosts inference speed, it does not mitigate the inherent conflict between classification and regression tasks, potentially compromising accuracy. Therefore, we used a decoupled head instead of a coupled head to enhance detection capabilities.
3.2. Overview Framework
As illustrated in
Figure 2, the proposed model’s comprehensive framework highlights three primary areas of advancement. First, a Segmentation Assistance module is introduced, designed to amplify the network’s focus on foreground subjects. Additionally, a Triplet Head tailored for few-shot object detection networks is developed, which employs a dual distillation mechanism to mitigate the issue of forgetting base-class knowledge. Lastly, the classification loss function is refined to steer the network’s emphasis toward those challenging samples that are particularly difficult to classify.
3.3. Segmentation Assistance (SA)
We were inspired by SuperYOLO’s use of super resolution for object detection. This led us to introduce the concept of segmentation as a means to enhance few-shot object detection.
For a detector to demonstrate robust generalization capabilities, substantial training data are indispensable. However, with limited training data, the risk of model overfitting on the few-shot training set rises, undermining its accuracy on test data.
To address this, the proposed SA, depicted in
Figure 3, seeks to bolster the model’s focus on the foreground entities. Firstly, it is imperative to acquire the labels pertinent to SA, as exemplified in
Figure 4. By leveraging the bounding box labels utilized in object detection, we aim to derive segmentation label maps specifically targeting foreground objects. Crucially, during this meticulous process, we will meticulously ensure that the tally of segmented labels aligns precisely with the count of detected labels, thus ensuring the integrity and accuracy of our labeling exercise. Once training data traverse the backbone and neck, feature maps of three distinct scales emerge. By leveraging convolutional blocks and up-sampling, the feature maps from a given layer are integrated with those from the subsequent layer, facilitating a comprehensive fusion. Subsequent to this fusion, the unified feature map undergoes successive convolution and up-sampling processes. The culmination of this process is a binary prediction map of the foreground target, derived from a convolution with a 3-sized kernel. The disparity between this predicted binary map and its corresponding label gives rise to a loss, denoted as
. This loss metric is computed using the cross-entropy loss function, as detailed in the subsequent equation:
where
is the label of samples, either at 0 or 1, and
is the predicted probability.
The generated binary map is class-agnostic, signifying that irrespective of target categories, there are only two pixel labels: pixel 255 for the foreground target and pixel 0 for the background.
Compared with the classification task, the regression task is relatively simple. With the assistance of SA, the model can focus on the area where the target is located in a complex background, enabling the detection network to learn more detailed target features.
Importantly, the introduced SA module serves as an auxiliary training branch, operational solely during the training stage. For the inference stage, this module is removed, ensuring no augmentation in inference latency.
3.4. Triplet Head (Tri-Head)
The dual detection head, introduced by Wolf et al. [
49], represents a significant advancement in the field of few-shot object detection. This innovative approach decouples the detection tasks for base and novel classes, allocating a distinct detection head for these two categories. Specifically, one head is dedicated to detecting base classes, while the other focuses solely on identifying novel classes. This decoupled mechanism effectively preserves the detection accuracy of novel classes while simultaneously enhancing the precision of base-class detection. However, a notable challenge arises from the substantial number of parameters inherent in this method, particularly when dealing with the detection head that possess a copious number of parameters, such as the decoupled head.
The concept of knowledge distillation, originally introduced by Hinton et al. [
50], leverages intricate teacher networks to supervise and guide the learning process of simpler student networks. The results obtained from teacher networks training are extracted to guide the results of student networks, thereby enabling students to learn better online. In order to alleviate the problem of forgetting base-class knowledge, we propose a Tri-Head and use a dual-path distillation method, as shown in
Figure 5.
On one hand, the detection head specifically designed for recognizing base classes serves as a teacher, guiding the final output detection head to mitigate the forgetting of knowledge related to base classes. On the other hand, the detection head responsible for detecting novel classes also fulfills the role of a teacher, instructing the final output detection head to enhance the detection performance for novel classes. The reason for using the dual distillation mechanism is that during the distillation of on , although it can alleviate the forgetting of base-class knowledge and improve the detection ability for base classes of , the introduced base-class knowledge will have an impact on novel classes, leading to a decrease in the detection performance of novel classes. To this end, we introduce a second distillation path, using to distill , so as to alleviate the forgetting of base-class knowledge while maintaining the detection performance of novel classes. In this way, the two teacher detection heads and can effectively balance guidance on .
Each knowledge distillation loss function is comprised of three distinct components: classification loss, regression loss, and confidence loss. All of these components employ the mean squared error (
MSE) loss function.
where
is the number of samples and
and
are the predicted results of the teacher head and the student head, respectively.
The two distillation loss functions are as follows:
where
,
, and
are regression, classification, and confidence results, respectively, and
and
are classification results of base classes and novel classes, respectively.
Therefore, the final distillation loss function is as follows:
3.5. Enhanced Classification Loss Function ()
The challenge of classification is particularly pronounced in few-shot object detection. Utilizing the conventional cross-entropy loss function often proves insufficient for few-shot classification due to the limited training samples, which hinders the development of a highly discriminative classifier. In scenarios where features of two categories are closely aligned, misclassifications can occur, such as mistaking a basketball court for a tennis court or misidentifying the background as the foreground.
An optimized version of the traditional cross-entropy classification loss function is introduced and termed
. The equation is expressed as follows:
where
represents the label;
represents the probability of prediction;
and
are adjustable factors.
The adjustable parameters and are typically set to values greater than 1. Given that the predicted probability falls within the interval [0, 1], both coefficients and exceed 1, leading to an augmented loss. When = 1, indicating a positive sample, a lower predicted probability implies a higher likelihood of misclassification as a negative sample. This results in a larger loss, as deduced from the first term in . Conversely, when the predicted probability is high, the loss diminishes. In the scenario where = 0, suggesting a negative sample, a higher predicted probability indicates a probable misclassification as a positive sample. This makes the loss larger, as interpreted from the second term in , whereas a lower predicted probability equates to a reduced loss.
The confidence loss evaluates the presence of a target within a bounding box and is essentially a classification problem. The same loss function used for classification is employed, denoted as . For the regression task, the regression loss adopted from the original YOLOv5 is utilized.
The total loss function is represented as follows:
3.6. Training Strategy
Similar to prior few-shot object detection approaches utilizing transfer learning, a two-stage training strategy is employed.
The first stage termed base training uses all base-class data .
In the second fine-tuning stage, a balanced training dataset is formed using a subset of the base-class and all data from the novel-class . Weights derived from the base training are applied, with the backbone network frozen, to then fine-tune and the neck on . After several epochs of training, the backbone, neck, and are frozen and and are fine-tuned on and , respectively. Finally, we froze the backbone, neck, and , only fine-tuning and on . Concurrently, the distillation process is initiated.
After the completion of the base training phase, the neck exhibits satisfactory responsiveness to base classes. However, when frozen during the subsequent fine-tuning stage, its ability to adapt to novel classes is hindered, leading to substandard performance in novel classes. Alternatively, in our Tri-Head architecture, if the neck remains unfrozen, the loss incurred by and can potentially disrupt its parameters, preventing it from achieving an optimal combination with . To address this challenge, we employed the fine-tuning strategy described above. Initially, we adapted the neck to align with the parameters of . Subsequently, we fixed the neck parameters and fine-tuned and to harmonize with the fixed neck, creating a synergistic combination in order to concentrate on guiding the training of .
It is worth noting that, given the issue of Incompletely Annotated Novel Objects mentioned by Zhang et al. [
24], and in order to maintain the detection performance of base classes, we did not restrict the number of training samples for base classes. Instead, we utilized all base-class samples and few-shot novel-class samples in the training images. As shown in
Figure 6, under the 20-shot setting of the DIOR dataset, the number of base-class samples may be higher or lower than 20, but the maximum number of novel-class samples is 20. The first 15 classes in the figure belong to base classes, while the last 5 classes belong to novel classes.
3.7. Datasets and Evaluation Metrics
NWPUv2 dataset [
51] encompasses 1172 images, each measuring 400 × 400 pixels, distributed across ten categories: airplane, baseball diamond, basketball court, bridge, ground track field, harbor, ship, storage tank, tennis court, and vehicle. Adhering to the methodology of prior studies [
23], the study classified airplane, baseball diamond, and tennis court as novel classes, designating the other seven as base classes. The sample counts for these novel classes are set at 3, 5, 10, and 20. Training involved both training and validation subsets, while evaluations are conducted on the testing subset.
DIOR dataset [
52] is comprehensive, containing 23,463 images distributed into 5862 for training, 5863 for validation and 11,738 for testing. Each image, sized at 800 × 800 pixels, spans 20 diverse categories, including airplane, airport, baseball diamond, basketball court, bridge, chimney, expressway service area, expressway toll station, dam, golf course, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill. Adopting the categorizations from previous studies [
23], 4 distinct dataset splits were created. In Split1, novel classes include baseball diamond, basketball court, bridge, chimney, and ship. Split2 identifies airplane, airport, expressway toll station, harbor, and ground track field as novel. Dam, golf course, storage tank, tennis court, and vehicle are the novel categorizations for Split3, and Split4 has expressway service area, overpass, stadium, train station, and windmill as its novel classes. In each split scenario, the other 15 categories function as the base classes. Novel classes are represented with varying sample sizes, specifically 3, 5, 10, and 20. Training is conducted using the training and validation subsets, while the testing subset is reserved for evaluation purposes.
Consistent with Zhang et al. [
23], the chosen evaluation metric is the mean Average Precision (mAP), calculated with an IoU threshold set at 0.5. The calculation of
value is expressed as follows:
where
is precision and
is recall.
Other evaluation indicators include FPS, FLOPs, and Params. FPS is the number of images processed per second; FLOPs refer to the number of floating-point operations, and Params represent the number of parameters required by the model [
53].
3.8. Implementation Details
The approach is implemented in PyTorch 1.9.1 and leveraged an NVIDIA 2080 Ti GPU for efficient training. During the base training phase, the learning rate is initialized at 0.01 and concluded at 0.001, adjusting through OneCycleLR strategy across 80 epochs on NWPUv2 and 300 epochs on DIOR. In the fine-tuning phase, training began with weights derived from base training. The learning rate started at 0.001, tapering to 0.00001, and is modified using OneCycleLR technique over 20,000 epochs on NWPUv2 and 10,000 epochs on DIOR. A consistent batch size of 8 is maintained in both phases.
5. Discussion
In
Table 1 and
Table 2, we compare our method with several algorithms. Generally speaking, on the relatively easier NWPUv2 dataset, our approach outperforms the state-of-the-art G-FSDet by 17.26% under the 3-shot setting and 10.96% under the 20-shot setting. On the more challenging DIOR dataset, compared to G-FSDet in Split1, our method achieves an improvement of 16.36% under the 3-shot scenario and 14.75% under the 20-shot scenario. When compared to the latest ST-FSOD algorithm, our method demonstrates an enhancement of 3.96% under the 3-shot setting and 4.53% under the 20-shot setting. Notably, the improvement is more pronounced with fewer training samples, and the rate of improvement diminishes as the training data volume increases. We conclude that as the training data continue to expand to a fully sufficient level, the disparity will become negligible and the gains from the few-shot learning strategy will be disregarded.
In
Table 3, we conduct a comparative experiment on detection speed. Our one-stage detector exhibits significantly faster detection speed, more than three times faster than two-stage detector-based methods such as TFA and G-FSDet. The choice of a one-stage detector as our baseline stems from the observation that current research seldom considers detection speed, whereas real-time performance is often crucial in practical applications. Our findings reveal that the integration of YOLOv5 with transfer learning can achieve promising results on remote sensing datasets, underscoring the necessity for greater attention to be paid to research on few-shot object detection utilizing one-stage detectors that offer superior real-time performance.
Table 4 and
Table 5 present the performance analysis of individual modules on two datasets. The inclusion of the SA module enhances the overall performance by 4.9% and 1.3% on the NWPUv2 and DIOR datasets, respectively, with notable improvements for the novel classes. This enhancement is attributed to the SA module’s ability to assist the network in focusing on the target objects amid complex remote sensing backgrounds, minimizing background feature interference, which is particularly valuable with limited training samples. The addition of the Tri-Head structure boosts base-class performance by 0.95% and 1.67% on NWPUv2 and DIOR, respectively, while maintaining or even improving novel-class performance. The two teacher heads in Tri-Head concurrently guide the learning of the output head, ensuring a balance between the base and novel classes during fine-tuning, thereby avoiding bias toward one class at the expense of the other. G-FSDet employs parallel branches to learn from the base and novel classes separately, which can be effective but increases the parameter count. Incorporating the improved loss function further improves the overall performance by 1.75% and 1.01% on NWPUv2 and DIOR, respectively, demonstrating that optimizing the classification loss function alone can yield certain gains. Nevertheless, misclassifications remain prevalent. G-FSDet utilizes metric learning to improve classification performance, but a large number of misclassifications are still inevitable, as shown in
Figure 11, indicating that the classification task is indeed a major challenge in few-shot object detection.
During the experiments, we observed that the number of training epochs significantly impacts the final results, with both insufficient and excessive training adversely affecting performance, as also noted by the author of ST-FSOD. Insufficient training leads to poor novel-class performance, while excessive training may cause overfitting. Additionally, due to the scarcity of training data, experimental results exhibit high variability, making it challenging to discern whether performance differences stem from the proposed methods or inherent fluctuations. Currently, we mitigated this uncertainty through multiple experiments. This issue also needs to be taken into account in future research.
6. Conclusions
This research addressed the prevailing challenges of few-shot object detection in remote sensing, specifically low detection accuracy and real-time performance considerations. Using the one-stage YOLOv5 detector as a foundation, the study incorporated transfer learning techniques to tackle the scarcity of data. The following enhancements were introduced: (1) a one-stage detector framework that balances both speed and accuracy was presented; (2) an SA module was designed using mask labels of foreground targets to aid training, amplifying focus on primary targets without additional inference time overhead; (3) Tri-Head, a dual distillation structure, reduced the forgetting of base-class knowledge and alleviated the performance degradation of novel-class detection caused by the introduction of base-class knowledge distillation; (4) A refined classification loss function was instituted to bolster classification efficacy. The experimental results confirm that the proposed methodology significantly elevates detection accuracy for novel classes while preserving the knowledge of base classes. Concurrently, there is an enhancement in detection speed, with the methodology exhibiting commendable outcomes on two datasets.
Future pursuits will be directed toward refining detection accuracy for diminutive targets and boosting the detector’s classification prowess, aiming to counteract false positives and misclassifications arising from target overlaps and smaller objects.