Next Article in Journal
Multi-Source Data-Driven Extraction of Urban Residential Space: A Case Study of the Guangdong–Hong Kong–Macao Greater Bay Area Urban Agglomeration
Previous Article in Journal
Rotational Motion Compensation for ISAR Imaging Based on Minimizing the Residual Norm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Few-Shot Object Detection for Remote Sensing Imagery Using Segmentation Assistance and Triplet Head

by
Jing Zhang
1,2,3,4,*,
Zhaolong Hong
1,
Xu Chen
1 and
Yunsong Li
2,3
1
Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China
2
State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China
3
School of Telecommunication Engineering, Xidian University, Xi’an 710071, China
4
Guangzhou Institute of Technology, Xidian University, Guangzhou 510700, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(19), 3630; https://doi.org/10.3390/rs16193630
Submission received: 29 July 2024 / Revised: 27 August 2024 / Accepted: 23 September 2024 / Published: 29 September 2024

Abstract

:
The emergence of few-shot object detection provides a new approach to address the challenge of poor generalization ability due to data scarcity. Currently, extensive research has been conducted on few-shot object detection in natural scene datasets, and notable progress has been made. However, in the realm of remote sensing, this technology is still lagging behind. Furthermore, many established methods rely on two-stage detectors, prioritizing accuracy over speed, which hinders real-time applications. Considering both detection accuracy and speed, in this paper, we propose a simple few-shot object detection method based on the one-stage detector YOLOv5 with transfer learning. First, we propose a Segmentation Assistance (SA) module to guide the network’s attention toward foreground targets. This module assists in training and enhances detection accuracy without increasing inference time. Second, we design a novel detection head called the Triplet Head (Tri-Head), which employs a dual distillation mechanism to mitigate the issue of forgetting base-class knowledge. Finally, we optimize the classification loss function to emphasize challenging samples. Evaluations on the NWPUv2 and DIOR datasets showcase the method’s superiority.

1. Introduction

Object detection, a foundational task in computer vision, has applications spanning sectors like security, military, transportation, and healthcare [1,2,3,4]. Advances in deep learning have ushered in innovative algorithms and enhanced optimization methods, propelling object detection forward. However, these advancements heavily rely on extensive training datasets. Insufficient data can culminate in overfitting and hinder model generalization [5,6].
The development of remote sensing technology has heightened interest in its object detection capabilities [7]. However, securing comprehensive data for intricate remote sensing scenarios and rare objects presents challenges. Gathering ample samples is time-intensive and manpower-reliant, with both sample quantity and quality critically influencing model performance [8,9].
Few-shot learning emerges as a promising solution to counteract the issues of diminished accuracy and generalization stemming from limited data. In the context of few-shot object detection, data categories are typically divided into base classes and novel classes. Base classes represent non-few-shot categories, which have abundant training data available. In contrast, novel classes are the few-shot categories, where only a limited amount of training data are available. These novel classes are the primary focus of few-shot object detection, as they pose a significant challenge due to the scarcity of training examples. Few-shot object detection strives to harness minimal data from novel classes, enhancing detection capabilities for these classes [10]. Presently, the majority of few-shot object detection studies concentrate on natural scene datasets, such as VOC2007 [11], VOC2012 [12], and MS-COCO [13], with substantial successes recorded [14,15,16,17]. However, in the remote sensing domain, the adaptation of few-shot detection is still nascent. Generally speaking, remote sensing images typically exhibit intricate backgrounds [18], significant variations in target dimensions [19], substantial intraclass variance, and comparatively minor interclass variance [20]. The stylistic disparities between remote sensing and natural scene data mean that algorithms efficient in the latter might falter in the former. As illustrated in Figure 1, the top row displays detection outcomes from the Two-stage Fine-tuning Approach (TFA) [15], which is efficient in natural scenes, when applied to remote sensing datasets. In contrast, the bottom row presents results from the method introduced in this study for the identical test set. This visual comparison underscores the substantial performance variance of a single algorithm across different dataset types. Hence, it is imperative to delve into the realm of few-shot object detection algorithms that are tailored to cater to the intricacies of remote sensing images.
The two-stage detector, Faster RCNN [21], serves as a foundational framework for few-shot object detection. A myriad of contemporary algorithms, rooted in Faster RCNN, consistently push the boundaries of detection accuracy [22,23,24]. While Faster RCNN employs a two-stage process for object classification and regression, securing high precision, its speed often lags behind real-world requirements. The introduction of You Only Look Once (YOLO) [25] marked a significant shift. In contrast to the two-stage counterparts, YOLO simultaneously outputs detected classes and their positions, offering a substantial speed boost. Its successors, namely YOLOv5 [26], YOLOX [27], and YOLOv6 [28], have adeptly straddled the balance between speed and accuracy. However, the domain of few-shot object detection remains underexplored for these one-stage detectors. Addressing this gap and harnessing one-stage detectors for few-shot challenges emerges as a pressing research frontier.
This study centers on few-shot object detection within remote sensing data, employing a one-stage detector to harmonize precision with real-time operational needs. Leveraging YOLOv5s as the foundational architecture, transfer learning facilitates the transfer of insights from base classes to novel ones. To bolster model generalization amid acute data paucity, a Segmentation Assistance (SA) module is introduced. This module harnesses an object’s binary mask images as supervisory labels and adopts the binary cross-entropy as the loss function. This approach enables the model to precisely localize foreground targets amidst intricate backgrounds, thus facilitating the object detection network to prioritize its attention toward the foreground region. To mitigate the issue of performance degradation in base classes during fine-tuning, we propose a novel detection head termed the Triplet Head (Tri-Head). Building upon the foundational concept of the decoupled detection head, our approach incorporates three distinct heads aimed at acquiring varied knowledge. The first head is designed to encompass all categories, including both base and novel classes, serving as the ultimate detection output. The second head is dedicated to acquiring comprehensive base-class knowledge, thereby aiding the first head in revising the detection results of the base classes. Lastly, the third head focuses solely on learning novel-class knowledge, thereby bolstering the first head’s ability of detecting novel class. Specifically, we adopt the technique of knowledge distillation to achieve this objective, utilizing a dual knowledge distillation mechanism to effectively bolster the detection capability of the detection head across all categories.
Additionally, we introduce a refined classification loss function that dynamically adjusts the weights of various loss terms based on the predicted results. This approach enhances the model’s emphasis on challenging samples during training in order to address the difficulty of classification in remote sensing images.
The main contributions of this article can be summarized as follows:
(1)
We propose a new few-shot object detection framework for remote sensing images based on a one-stage detector with a two-stage training strategy, which simultaneously ensures detection accuracy and speed.
(2)
We propose a Segmentation Assistance module that enhances attention to foreground regions without increasing inference delay by using binary mask maps of foreground targets as additional supervised labels.
(3)
We propose a Triplet Head, a detection head that incorporates a dual distillation mechanism to strengthen the model’s capacity to detect base classes while preserving its ability to detect novel classes, thereby addressing the challenge of forgetting base-class knowledge during fine-tuning.
(4)
We optimize the classification loss function to adaptively adjust the weights of different samples based on the predicted results during the training process, enabling difficult samples to receive more attention.

2. Related Work

2.1. General Object Detection

Object detectors typically fall into two categories: two-stage detectors and one-stage detectors. Traditionally, two-stage detectors have prioritized accuracy, albeit at the cost of speed. In contrast, one-stage detectors, with their swift detection capabilities, have sometimes sacrificed accuracy. However, recent technological advancements have facilitated one-stage detectors to adeptly balance both attributes, positioning them as the contemporary norm.
Faster RCNN [21] operationalizes a two-step detection process. Initially, it generates region proposals via Region Proposal Network (RPN) and subsequently classifies and pinpoints these proposals. Owing to the multitude of candidate boxes generated by RPN, Faster RCNN’s accuracy remains robust. To circumvent the parameter-intensive nature of using fully connected layers for classification and regression, Dai et al. [29] introduced R-FCN, a fully convolutional network, which dramatically curtails parameter counts. Traditional algorithms, which employ fixed Intersection over Union (IoU) thresholds to discern between positive and negative samples, encounter challenges: exceedingly high thresholds result in scarce qualifying positive samples, while exceedingly low ones yield an overabundance of them, compromising quality. Cascade RCNN [30] addresses this by employing cascaded detection heads with varied IoU thresholds, optimizing detection outcomes.
YOLO [25], a seminal one-stage detector, initiates detection by partitioning an image into uniformly sized grids. Each grid anticipates several bounding boxes, detailing object-centric parameters such as center coordinates, dimensions, confidence scores, and category. Subsequently, the Non-Maximum Suppression (NMS) algorithm is deployed to retain the most confident bounding boxes. With the introduction of YOLOv3 [31], Feature Pyramid Network (FPN) [32] was integrated, enabling predictions across three diverse scales, enhancing the detection of variably sized objects. Addressing the imbalance between positive and negative sample detections, RetinaNet [33] introduced the Focal Loss mechanism. This approach accentuates challenging samples within the loss function, directing focus toward them. Both YOLOX [27] and YOLOv6 [28] adopted the decoupled head in lieu of the traditional coupled version, effectively resolving tensions between classification and regression tasks. Recognizing the perennial challenge posed by small objects, SuperYOLO [34] harnesses super-resolution techniques to bolster detection accuracy for diminutive subjects in remote sensing imagery. This augmentation markedly improves detection fidelity for such objects without prolonging inference durations.

2.2. Few-Shot Object Detection

Given the considerable challenges associated with collecting extensive training samples, especially for rare objects, few-shot object detection has emerged as a promising solution. Current strategies in this domain predominantly pivot around meta-learning and transfer learning techniques [35,36].
Meta-learning, at its core, seeks to understand the mechanics of task generalization. A trailblazing contribution in this space is Meta-YOLO [14], which deploys a reweighting module. This module transforms support set features into vectors, subsequently reweighting the meta-features to accentuate salient attributes. AFD-Net [37] maintains that classification and regression are distinct subtasks. The former subtask primarily focuses on providing rough locations of targets through classification, while the latter aims to estimate precise target states through refined bounding boxes. Consequently, an adaptive full-duplex network is proposed to decouple regression and classification in the processes of feature representation, model reweighting, and state estimation.
Transfer learning stands out as an intuitive and potent strategy for few-shot object detection, leveraging pretrained weights from base classes to enhance the model’s adaptability to novel classes. TFA [15], notable for its simplicity and effectiveness, retains normalcy during the base training phase, freezing all but the last layer during the fine-tuning phase. This fine-tuning process relies on a harmonized dataset of novel and base classes, yielding commendable outcomes. The Retentive R-CNN [38] model incorporates a consistency loss, mitigating the erasure of base-class knowledge. Bi-path YOLO [39] marks a significant exploration into real-time few-shot object detection, utilizing YOLO as its foundation and incorporating transfer learning. While it excels in real-time processing, it lags behind two-stage methods in terms of accuracy. FSSP [40] and positive sample enhancement [41] represent further advancements within the YOLO framework for few-shot object detection. These methods are grounded on YOLOv3 and YOLOv4 [42], respectively, and employ a Siamese neural network design. The positive sample enhancement technique is deployed in the reinforcement branch, with the noteworthy feature of sharing weights between this branch and the primary trunk during the training phase.

2.3. Few-Shot Object Detection in Remote Sensing

Relatively speaking, there are fewer studies on few-shot object detection in the field of remote sensing, which still has much room for improvement. Similarly, mainstream methods can also be divided into meta learning and transfer learning.
FSODM [22], building upon YOLOv3, leverages the weighted feature concept as Meta-YOLO and specializes in researching remote sensing images. TINet [43] introduces FPN and the Transformation-Invariant Network to tackle the issues of scale variation and directional change in remote sensing images. Zhang et al. [44] proposed a self-adaptive global similarity module that preserves internal context information and computes similarity mappings between objects in the support and query images. Additionally, they introduced a two-way foreground stimulator module that applies the similarity maps to the detailed embeddings of both the support and query images, fully utilizing support information to further enhance foreground objects and weaken unrelated samples.
CIR-FSD [45] introduced a context information refinement module, refining the inherent features and capitalizing on transfer learning, particularly tailored for remote sensing imagery in few-shot object detection. G-FSDet [23] presented an enhanced transfer learning framework tailored for remote sensing datasets. Drawing inspiration from TFA, it introduced a representation compensation module. Its detection head features a parallel design, freezing the base-class branch during fine-tuning and solely adjusting the novel-class branch, which minimizes knowledge attrition from the base classes. Zhang et al. [24] first discussed and addressed the issue of Incompletely Annotated Novel Objects in the remote sensing domain, which arises when input images contain multiple novel-class objects but only a subset of them is annotated. Unlabeled objects are treated as background during the training process. They integrate a self-training mechanism into the fine-tuning process, aiming to discover unlabeled novel-class objects and focus on them during training.

3. Proposed Method

3.1. Preliminary Knowledge

3.1.1. Problem Definition

Based on previous studies [23], the original dataset D is partitioned into two distinct subsets, the base-class subset D b a s e and the novel-class subset D n o v e l , and D b a s e embodies categories abundant in training data, often termed non-few-shot categories. In contrast, D n o v e l features categories identified as few-shot, marked by their limited training samples. These subsets are mathematically defined and related as D = D b a s e D n o v e l and D b a s e D n o v e l = . Both base- and novel-class data are used for training, with accuracy assessments for each subset carried out during inference.

3.1.2. Baseline

The majority of previous works [23,24,45] selected Faster RCNN as their baseline. This preference can be attributed to the early exploration of few-shot object detection on natural scene datasets, where the constraints on sample size often hampered the performance of one-stage detectors like YOLO. As a result, two-stage detectors emerged as the predominant focus. When the research direction shifted toward remote sensing datasets, many methods still adhered to the convention and adopted Faster RCNN as the reference model. Nonetheless, studies [46,47] have indicated the superior performance of YOLO over Faster RCNN, specifically for remote sensing datasets. Additionally, in terms of real-time application, Faster RCNN lags in speed, rendering it unsuitable for practical deployments [48]. Given these considerations, this study selected YOLOv5’s s-version as the preferred baseline.
YOLOv5 primarily comprises three components: the backbone, neck, and head. The backbone employs CSP-Darknet53 for feature extraction, with its foundational convolutional block, CBS, consisting of a Convolution, a Batch Normalization, and a Silu activation function. The neck utilizes PANet, an enhancement over the traditional FPN, facilitating comprehensive feature integration from different levels. This can effectively address the issue of scale variation in remote sensing images. Lastly, the head, a staple of the YOLO series, employs a coupled structure for prediction. While this design minimizes parameter count and boosts inference speed, it does not mitigate the inherent conflict between classification and regression tasks, potentially compromising accuracy. Therefore, we used a decoupled head instead of a coupled head to enhance detection capabilities.

3.2. Overview Framework

As illustrated in Figure 2, the proposed model’s comprehensive framework highlights three primary areas of advancement. First, a Segmentation Assistance module is introduced, designed to amplify the network’s focus on foreground subjects. Additionally, a Triplet Head tailored for few-shot object detection networks is developed, which employs a dual distillation mechanism to mitigate the issue of forgetting base-class knowledge. Lastly, the classification loss function is refined to steer the network’s emphasis toward those challenging samples that are particularly difficult to classify.

3.3. Segmentation Assistance (SA)

We were inspired by SuperYOLO’s use of super resolution for object detection. This led us to introduce the concept of segmentation as a means to enhance few-shot object detection.
For a detector to demonstrate robust generalization capabilities, substantial training data are indispensable. However, with limited training data, the risk of model overfitting on the few-shot training set rises, undermining its accuracy on test data.
To address this, the proposed SA, depicted in Figure 3, seeks to bolster the model’s focus on the foreground entities. Firstly, it is imperative to acquire the labels pertinent to SA, as exemplified in Figure 4. By leveraging the bounding box labels utilized in object detection, we aim to derive segmentation label maps specifically targeting foreground objects. Crucially, during this meticulous process, we will meticulously ensure that the tally of segmented labels aligns precisely with the count of detected labels, thus ensuring the integrity and accuracy of our labeling exercise. Once training data traverse the backbone and neck, feature maps of three distinct scales emerge. By leveraging convolutional blocks and up-sampling, the feature maps from a given layer are integrated with those from the subsequent layer, facilitating a comprehensive fusion. Subsequent to this fusion, the unified feature map undergoes successive convolution and up-sampling processes. The culmination of this process is a binary prediction map of the foreground target, derived from a convolution with a 3-sized kernel. The disparity between this predicted binary map and its corresponding label gives rise to a loss, denoted as L S A . This loss metric is computed using the cross-entropy loss function, as detailed in the subsequent equation:
L S A = Y log ( P ) ( 1 Y ) log ( 1 P )
where Y is the label of samples, either at 0 or 1, and P is the predicted probability.
The generated binary map is class-agnostic, signifying that irrespective of target categories, there are only two pixel labels: pixel 255 for the foreground target and pixel 0 for the background.
Compared with the classification task, the regression task is relatively simple. With the assistance of SA, the model can focus on the area where the target is located in a complex background, enabling the detection network to learn more detailed target features.
Importantly, the introduced SA module serves as an auxiliary training branch, operational solely during the training stage. For the inference stage, this module is removed, ensuring no augmentation in inference latency.

3.4. Triplet Head (Tri-Head)

The dual detection head, introduced by Wolf et al. [49], represents a significant advancement in the field of few-shot object detection. This innovative approach decouples the detection tasks for base and novel classes, allocating a distinct detection head for these two categories. Specifically, one head is dedicated to detecting base classes, while the other focuses solely on identifying novel classes. This decoupled mechanism effectively preserves the detection accuracy of novel classes while simultaneously enhancing the precision of base-class detection. However, a notable challenge arises from the substantial number of parameters inherent in this method, particularly when dealing with the detection head that possess a copious number of parameters, such as the decoupled head.
The concept of knowledge distillation, originally introduced by Hinton et al. [50], leverages intricate teacher networks to supervise and guide the learning process of simpler student networks. The results obtained from teacher networks training are extracted to guide the results of student networks, thereby enabling students to learn better online. In order to alleviate the problem of forgetting base-class knowledge, we propose a Tri-Head and use a dual-path distillation method, as shown in Figure 5.
On one hand, the detection head H b a s e specifically designed for recognizing base classes serves as a teacher, guiding the final output detection head H o u t p u t to mitigate the forgetting of knowledge related to base classes. On the other hand, the detection head H n o v e l responsible for detecting novel classes also fulfills the role of a teacher, instructing the final output detection head H o u t p u t to enhance the detection performance for novel classes. The reason for using the dual distillation mechanism is that during the distillation of H b a s e on H o u t p u t , although it can alleviate the forgetting of base-class knowledge and improve the detection ability for base classes of H o u t p u t , the introduced base-class knowledge will have an impact on novel classes, leading to a decrease in the detection performance of novel classes. To this end, we introduce a second distillation path, using H n o v e l to distill H o u t p u t , so as to alleviate the forgetting of base-class knowledge while maintaining the detection performance of novel classes. In this way, the two teacher detection heads H b a s e and H n o v e l can effectively balance guidance on H o u t p u t .
Each knowledge distillation loss function is comprised of three distinct components: classification loss, regression loss, and confidence loss. All of these components employ the mean squared error (MSE) loss function.
M S E = 1 N i = 1 N ( x i y i ) 2
where N is the number of samples and x i and y i are the predicted results of the teacher head and the student head, respectively.
The two distillation loss functions are as follows:
L D i s t i l l a t i o n - b a s e = M S E ( H b a s e ( b o x ) ,   H o u t p u t ( b o x ) ) +   M S E ( H b a s e ( c l s ) ,   H o u t p u t ( c l s - b a s e ) ) +   M S E ( H b a s e ( o b j ) ,   H o u t p u t ( o b j ) )
L D i s t i l l a t i o n - n o v e l = M S E ( H n o v e l ( b o x ) ,   H o u t p u t ( b o x ) ) +   M S E ( H n o v e l ( c l s ) ,   H o u t p u t ( c l s - n o v e l ) ) +   M S E ( H n o v e l ( o b j ) ,   H o u t p u t ( o b j ) )
where H ( b o x ) , H ( c l s ) , and H ( o b j ) are regression, classification, and confidence results, respectively, and H ( c l s - b a s e ) and H ( c l s - n o v e l ) are classification results of base classes and novel classes, respectively.
Therefore, the final distillation loss function is as follows:
L D i s t i l l a t i o n = L D i s t i l l a t i o n b a s e + L D i s t i l l a t i o n n o v e l

3.5. Enhanced Classification Loss Function ( L E n C l s )

The challenge of classification is particularly pronounced in few-shot object detection. Utilizing the conventional cross-entropy loss function often proves insufficient for few-shot classification due to the limited training samples, which hinders the development of a highly discriminative classifier. In scenarios where features of two categories are closely aligned, misclassifications can occur, such as mistaking a basketball court for a tennis court or misidentifying the background as the foreground.
An optimized version of the traditional cross-entropy classification loss function is introduced and termed L E n C l s . The equation is expressed as follows:
L E n C l s = Y ( 2 P ) a log ( P ) ( 1 Y ) ( 1 + P ) b log ( 1 P )
where Y represents the label; P represents the probability of prediction; a and b are adjustable factors.
The adjustable parameters a and b are typically set to values greater than 1. Given that the predicted probability P falls within the interval [0, 1], both coefficients ( 2 P ) a and ( 1 + P ) b exceed 1, leading to an augmented loss. When Y = 1, indicating a positive sample, a lower predicted probability P implies a higher likelihood of misclassification as a negative sample. This results in a larger loss, as deduced from the first term in L E n C l s . Conversely, when the predicted probability P is high, the loss diminishes. In the scenario where Y = 0, suggesting a negative sample, a higher predicted probability P indicates a probable misclassification as a positive sample. This makes the loss larger, as interpreted from the second term in L E n C l s , whereas a lower predicted probability P equates to a reduced loss.
The confidence loss evaluates the presence of a target within a bounding box and is essentially a classification problem. The same loss function used for classification is employed, denoted as L E n O b j . For the regression task, the regression loss adopted from the original YOLOv5 is utilized.
The total loss function is represented as follows:
L = L S A + L E n C l s + L E n O b j + L Re g + L D i s t i l l a t i o n

3.6. Training Strategy

Similar to prior few-shot object detection approaches utilizing transfer learning, a two-stage training strategy is employed.
The first stage termed base training uses all base-class data D b a s e .
In the second fine-tuning stage, a balanced training dataset D f t = D n o v e l D b a s e l i t t l e is formed using a subset D b a s e l i t t l e of the base-class D b a s e and all data from the novel-class D n o v e l . Weights derived from the base training are applied, with the backbone network frozen, to then fine-tune H o u t p u t and the neck on D f t . After several epochs of training, the backbone, neck, and H o u t p u t are frozen and H b a s e and H n o v e l are fine-tuned on D b a s e and D n o v e l , respectively. Finally, we froze the backbone, neck, and H b a s e , only fine-tuning H n o v e l and H o u t p u t on D f t . Concurrently, the distillation process is initiated.
After the completion of the base training phase, the neck exhibits satisfactory responsiveness to base classes. However, when frozen during the subsequent fine-tuning stage, its ability to adapt to novel classes is hindered, leading to substandard performance in novel classes. Alternatively, in our Tri-Head architecture, if the neck remains unfrozen, the loss incurred by H b a s e and H n o v e l can potentially disrupt its parameters, preventing it from achieving an optimal combination with H o u t p u t . To address this challenge, we employed the fine-tuning strategy described above. Initially, we adapted the neck to align with the parameters of H o u t p u t . Subsequently, we fixed the neck parameters and fine-tuned H b a s e and H n o v e l to harmonize with the fixed neck, creating a synergistic combination in order to concentrate on guiding the training of H o u t p u t .
It is worth noting that, given the issue of Incompletely Annotated Novel Objects mentioned by Zhang et al. [24], and in order to maintain the detection performance of base classes, we did not restrict the number of training samples for base classes. Instead, we utilized all base-class samples and few-shot novel-class samples in the training images. As shown in Figure 6, under the 20-shot setting of the DIOR dataset, the number of base-class samples may be higher or lower than 20, but the maximum number of novel-class samples is 20. The first 15 classes in the figure belong to base classes, while the last 5 classes belong to novel classes.

3.7. Datasets and Evaluation Metrics

NWPUv2 dataset [51] encompasses 1172 images, each measuring 400 × 400 pixels, distributed across ten categories: airplane, baseball diamond, basketball court, bridge, ground track field, harbor, ship, storage tank, tennis court, and vehicle. Adhering to the methodology of prior studies [23], the study classified airplane, baseball diamond, and tennis court as novel classes, designating the other seven as base classes. The sample counts for these novel classes are set at 3, 5, 10, and 20. Training involved both training and validation subsets, while evaluations are conducted on the testing subset.
DIOR dataset [52] is comprehensive, containing 23,463 images distributed into 5862 for training, 5863 for validation and 11,738 for testing. Each image, sized at 800 × 800 pixels, spans 20 diverse categories, including airplane, airport, baseball diamond, basketball court, bridge, chimney, expressway service area, expressway toll station, dam, golf course, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill. Adopting the categorizations from previous studies [23], 4 distinct dataset splits were created. In Split1, novel classes include baseball diamond, basketball court, bridge, chimney, and ship. Split2 identifies airplane, airport, expressway toll station, harbor, and ground track field as novel. Dam, golf course, storage tank, tennis court, and vehicle are the novel categorizations for Split3, and Split4 has expressway service area, overpass, stadium, train station, and windmill as its novel classes. In each split scenario, the other 15 categories function as the base classes. Novel classes are represented with varying sample sizes, specifically 3, 5, 10, and 20. Training is conducted using the training and validation subsets, while the testing subset is reserved for evaluation purposes.
Consistent with Zhang et al. [23], the chosen evaluation metric is the mean Average Precision (mAP), calculated with an IoU threshold set at 0.5. The calculation of A P value is expressed as follows:
A P = 1 11 R 0 , 0.1 , , 1 P ( R )
where P is precision and R is recall.
Other evaluation indicators include FPS, FLOPs, and Params. FPS is the number of images processed per second; FLOPs refer to the number of floating-point operations, and Params represent the number of parameters required by the model [53].

3.8. Implementation Details

The approach is implemented in PyTorch 1.9.1 and leveraged an NVIDIA 2080 Ti GPU for efficient training. During the base training phase, the learning rate is initialized at 0.01 and concluded at 0.001, adjusting through OneCycleLR strategy across 80 epochs on NWPUv2 and 300 epochs on DIOR. In the fine-tuning phase, training began with weights derived from base training. The learning rate started at 0.001, tapering to 0.00001, and is modified using OneCycleLR technique over 20,000 epochs on NWPUv2 and 10,000 epochs on DIOR. A consistent batch size of 8 is maintained in both phases.

4. Results

The performance was benchmarked against both classic and state-of-the-art algorithms on the NWPUv2 and DIOR datasets.

4.1. Results on the NWPUv2 Dataset

Table 1 presents a comparative analysis between the proposed approach and other methods on the NWPUv2 dataset. Traditional few-shot object detection methods like TFA underperform, primarily fine-tuning only the terminal fully connected layer during its refinement phase. While it fares decently on natural scene datasets like VOC2007 and VOC2012, its performance diminishes on remote sensing datasets. G-FSDet, employing Faster RCNN as its foundation, marks a considerable improvement over its counterparts, though its detection accuracy remains subpar. However, the introduced approach outperforms all other methods, setting a new performance benchmark.
Figure 7 presents the visual outcomes of the proposed approach, G-FSDet and YOLOv5, on the NWPUv2 dataset under the 3-shot setting. The first row shows the input images; the second row represents the ground truth; the third row displays the results from the original YOLOv5; the fourth row displays the results from G-FSDet, and the last row presents the results from our proposed method. Despite the limited data setting, the proposed method yields commendable results.
To further demonstrate the effectiveness of our method, we listed the detection results of each category in base classes and novel classes on the NWPUv2 dataset under the 3-shot setting, as shown in Figure 8 and Figure 9, respectively. We compared them with baseline YOLOv5. The confusion matrix depicted in Figure 10 serves as a visual representation of the probabilities associated with the correct and incorrect classifications of each category. The background class itself does not belong to a category in the dataset. In the confusion matrix, the background class is used to indicate how many targets are mistakenly detected as background (missed detections) or how much background is mistakenly detected as targets (false detections). Along the horizontal axis, the actual category is indicated, while the vertical axis signifies the predicted category. Upon examining the matrix, it becomes evident that within the base classes, the ‘bridge’ category exhibits a comparatively higher probability for misclassification. Similarly, among the novel classes, the ‘airplane’ and ‘tennis court’ categories also demonstrate a notable probability of being misclassified. We designated these samples, which pose a challenge in accurate classification, as difficult samples. As observed in Figure 8 and Figure 9, our proposed method demonstrated notable advancements in addressing the classification of difficult samples, exhibiting a distinct improvement over the baseline.

4.2. Results on the DIOR Dataset

Table 2 displays a comparison of the proposed approach with other established methods using the DIOR dataset. Notably, traditional algorithms such as Meta-YOLO and TFA lag in performance. Meanwhile, newer few-shot object detection technique like G-FSDet, which leverages transfer learning in remote sensing, offers a marked improvement over their classic counterparts. In most cases, the proposed technique outshines G-FSDet across both base- and novel-class detections. In comparison to the self-training method, ST-FSOD, our approach consistently surpasses it in terms of the detection accuracy for base classes. When it comes to detecting novel classes, our methodology exhibits a notable advantage in certain scenarios, though it trails slightly in others. This disparity could potentially stem from the fact that ST-FSOD employs a semi-supervised learning paradigm, which incorporates additional unlabeled data into its training process, providing it with a broader and potentially more diverse set of information to leverage. The use of SA and the enhanced classification loss function bolsters the detection performance for novel classes. With the incorporation of Tri-Head, the approach exhibits superior base-class detection. In summation, this method not only enhances novel-class detection accuracy, but also substantially mitigates the base-class knowledge-forgetting issue, setting a new benchmark in terms of performance.
Figure 11 presents the visual outcomes of the proposed approach, G-FSDet and YOLOv5, on the DIOR dataset under the 10-shot setting in Split1. The few-shot categories in Split1 are baseball diamond, basketball court, bridge, chimney, and ship. In the third row, examining the results of YOLOv5 reveals a significant number of false detections. In the fourth row, G-FSDet also experienced some erroneous detections. However, in the last row, our proposed method’s detection results demonstrate a substantial improvement in reducing these false detections.

4.3. Results of Detecting Speed on the DIOR Dataset

To evaluate the real-time performance of our proposed method, we conducted a comparative analysis with the established TFA algorithm and the recently updated G-FSDet algorithm, which are based on Faster RCNN. Furthermore, we also included the results of the original YOLOv5 in our comparative analysis. The comparative results are summarized in Table 3. These evaluations, summarized in the table, were uniformly conducted on an NVIDIA 2080 Ti GPU, and the reported metrics represent an average across multiple runs. Due to the scarcity of contemporary methods grounded in one-stage detectors, our speed comparison was necessarily limited to established two-stage detectors as well as the original YOLOv5 framework.
Ours-training represents the model we trained, encompassing the SA module and the three detection heads in Tri-Head. In the inference stage, we employed the model designated as Ours-inference, which excludes SA module and retains solely the final detection head from Tri-Head. Notably, the architecture of this inference model aligns precisely with the original YOLOv5 model, thereby ensuring that our approach remains identical to YOLOv5 in terms of parameter count, computational complexity, and inference latency. This underscores the efficiency of our methodology, as it effectively enhances detection performance without incurring any additional inference delay.

4.4. Ablation Studies

4.4.1. Ablation Study of Proposed Method

Ablation experiments were conducted on both the NWPUv2 and DIOR datasets to validate the efficacy of the introduced components. The results from the NWPUv2 dataset under the 3-shot setting are presented in Table 4, while the 10-shot results from the DIOR dataset are detailed in Table 5.
In the initial row of Table 4 and Table 5, we present the baseline performance of the original YOLOv5 model. Upon the integration of the SA module, a marked enhancement in the detection accuracy of novel classes was observed, with a concurrent minor improvement in the performance of base classes. Specifically, novel classes exhibited a significant improvement of 16% on the NWPUv2 dataset and a noteworthy 4.6% enhancement on the more intricate DIOR dataset. The data suggest that the SA module has minimal influence on base-class detection but considerably bolsters novel-class detection, highlighting the SA module’s capability to enhance focus on foreground targets in a few-shot scenario. Most significantly, as evident in Table 3, the SA module solely serves as an auxiliary during the training phase and has no impact during the inference phase. Consequently, this module can be omitted during inference, effectively enhancing performance without introducing any additional delay, thereby further affirming the superiority of this approach.
During the base training phase on the NWPUv2 dataset, the detection accuracy for base classes hovers around 96%. However, after the introduction of novel classes during the fine-tuning phase, this accuracy is slightly higher than 92%. Following the introduction of the Tri-Head mechanism, the accuracy of base classes soared to over 93%, and the performance of novel classes exhibited notable enhancements as well. On the DIOR dataset, the performance of base classes and novel classes exhibited improvements of 1.67% and 3%, respectively, indicating a substantial enhancement in detection accuracy. This significant advancement suggests that the three-headed structure and dual-distillation mechanism employed by Tri-Head not only mitigate the issue of knowledge forgetting in base classes but also enhance the detection capabilities of novel classes to a considerable degree. Furthermore, as demonstrated in Table 3, two of the three detection heads in Tri-Head function as teacher detection heads, solely contributing during the training phase. This allows for their removal during the inference stage, thus eliminating any potential increase in inference delay.
In Table 4, upon introducing the addition of L E n C l s , base-class performance and novel-class performance on the NWPUv2 dataset exhibited increments of 1.22% and 1.99%, respectively. In contrast, Table 5 reveals a marginal 0.03% decrement in base-class performance of the DIOR dataset, whereas novel-class performance experienced a significant improvement of 6.99%. Overall, L E n C l s can effectively improve the detection ability of novel classes while maintaining the performance improvement or slight decrease in the base classes. This is attributable to the assistance of L E n C l s in classification tasks, particularly in handling objects that are challenging to classify.

4.4.2. Impact of the Hyperparameters in Classification Loss

The proposed loss function L E n C l s has two hyperparameters, a and b . Their effects were evaluated by adjusting their values, with the results presented in Table 6 and Table 7. Taking into account the performance of both base classes and novel classes, we ultimately chose a value of 0.5 for a and 10 for b .

5. Discussion

In Table 1 and Table 2, we compare our method with several algorithms. Generally speaking, on the relatively easier NWPUv2 dataset, our approach outperforms the state-of-the-art G-FSDet by 17.26% under the 3-shot setting and 10.96% under the 20-shot setting. On the more challenging DIOR dataset, compared to G-FSDet in Split1, our method achieves an improvement of 16.36% under the 3-shot scenario and 14.75% under the 20-shot scenario. When compared to the latest ST-FSOD algorithm, our method demonstrates an enhancement of 3.96% under the 3-shot setting and 4.53% under the 20-shot setting. Notably, the improvement is more pronounced with fewer training samples, and the rate of improvement diminishes as the training data volume increases. We conclude that as the training data continue to expand to a fully sufficient level, the disparity will become negligible and the gains from the few-shot learning strategy will be disregarded.
In Table 3, we conduct a comparative experiment on detection speed. Our one-stage detector exhibits significantly faster detection speed, more than three times faster than two-stage detector-based methods such as TFA and G-FSDet. The choice of a one-stage detector as our baseline stems from the observation that current research seldom considers detection speed, whereas real-time performance is often crucial in practical applications. Our findings reveal that the integration of YOLOv5 with transfer learning can achieve promising results on remote sensing datasets, underscoring the necessity for greater attention to be paid to research on few-shot object detection utilizing one-stage detectors that offer superior real-time performance.
Table 4 and Table 5 present the performance analysis of individual modules on two datasets. The inclusion of the SA module enhances the overall performance by 4.9% and 1.3% on the NWPUv2 and DIOR datasets, respectively, with notable improvements for the novel classes. This enhancement is attributed to the SA module’s ability to assist the network in focusing on the target objects amid complex remote sensing backgrounds, minimizing background feature interference, which is particularly valuable with limited training samples. The addition of the Tri-Head structure boosts base-class performance by 0.95% and 1.67% on NWPUv2 and DIOR, respectively, while maintaining or even improving novel-class performance. The two teacher heads in Tri-Head concurrently guide the learning of the output head, ensuring a balance between the base and novel classes during fine-tuning, thereby avoiding bias toward one class at the expense of the other. G-FSDet employs parallel branches to learn from the base and novel classes separately, which can be effective but increases the parameter count. Incorporating the improved loss function further improves the overall performance by 1.75% and 1.01% on NWPUv2 and DIOR, respectively, demonstrating that optimizing the classification loss function alone can yield certain gains. Nevertheless, misclassifications remain prevalent. G-FSDet utilizes metric learning to improve classification performance, but a large number of misclassifications are still inevitable, as shown in Figure 11, indicating that the classification task is indeed a major challenge in few-shot object detection.
During the experiments, we observed that the number of training epochs significantly impacts the final results, with both insufficient and excessive training adversely affecting performance, as also noted by the author of ST-FSOD. Insufficient training leads to poor novel-class performance, while excessive training may cause overfitting. Additionally, due to the scarcity of training data, experimental results exhibit high variability, making it challenging to discern whether performance differences stem from the proposed methods or inherent fluctuations. Currently, we mitigated this uncertainty through multiple experiments. This issue also needs to be taken into account in future research.

6. Conclusions

This research addressed the prevailing challenges of few-shot object detection in remote sensing, specifically low detection accuracy and real-time performance considerations. Using the one-stage YOLOv5 detector as a foundation, the study incorporated transfer learning techniques to tackle the scarcity of data. The following enhancements were introduced: (1) a one-stage detector framework that balances both speed and accuracy was presented; (2) an SA module was designed using mask labels of foreground targets to aid training, amplifying focus on primary targets without additional inference time overhead; (3) Tri-Head, a dual distillation structure, reduced the forgetting of base-class knowledge and alleviated the performance degradation of novel-class detection caused by the introduction of base-class knowledge distillation; (4) A refined classification loss function was instituted to bolster classification efficacy. The experimental results confirm that the proposed methodology significantly elevates detection accuracy for novel classes while preserving the knowledge of base classes. Concurrently, there is an enhancement in detection speed, with the methodology exhibiting commendable outcomes on two datasets.
Future pursuits will be directed toward refining detection accuracy for diminutive targets and boosting the detector’s classification prowess, aiming to counteract false positives and misclassifications arising from target overlaps and smaller objects.

Author Contributions

Conceptualization, J.Z. and Z.H.; methodology, J.Z. and Z.H.; software, J.Z., Z.H., X.C. and Y.L.; validation, Z.H. and X.C.; formal analysis, J.Z.; investigation, Z.H.; resources, J.Z.; data curation, Z.H. and X.C.; writing—original draft preparation, Z.H.; writing—review and editing, J.Z. and Z.H.; visualization, Z.H. and X.C.; supervision, Y.L.; project administration, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by National Science Foundation of China under Grant 62371362.

Data Availability Statement

The experiments are evaluated on publicly open datasets. The access manner of the datasets can refer to the corresponding published papers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kaur, J.; Singh, W. Tools, techniques, datasets and application areas for object detection in an image: A review. Multimed. Tools Appl. 2022, 81, 38297–38351. [Google Scholar] [CrossRef] [PubMed]
  2. Zhang, J.; Xu, D.; Li, Y.; Zhao, L.; Su, R. FusionPillars: A 3D Object Detection Network with Cross-Fusion and Self-Fusion. Remote Sens. 2023, 15, 2692. [Google Scholar] [CrossRef]
  3. Shou, Y.; Meng, T.; Ai, W.; Xie, C.; Liu, H.; Wang, Y. Object Detection in Medical Images Based on Hierarchical Transformer and Mask Mechanism. Comput. Intell. Neurosci. 2022, 2022, 5863782. [Google Scholar] [CrossRef] [PubMed]
  4. Shi, Y.; Fan, Y.; Xu, S.; Gao, Y.; Gao, R. Object detection by attention-guided feature fusion network. Symmetry 2022, 14, 887. [Google Scholar] [CrossRef]
  5. Antonelli, S.; Avola, D.; Cinque, L.; Crisostomi, D.; Foresti, G.L.; Galasso, F.; Marini, M.R.; Mecca, A.; Pannone, D. Few-shot object detection: A survey. ACM Comput. Surv. 2022, 54, 1–37. [Google Scholar] [CrossRef]
  6. Chen, J.; Qin, D.; Hou, D.; Zhang, J.; Deng, M.; Sun, G. Multiscale Object Contrastive Learning-Derived Few-Shot Object Detection in VHR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5635615. [Google Scholar] [CrossRef]
  7. Yu, N.; Ren, H.; Deng, T.; Fan, X. Stepwise Locating Bidirectional Pyramid Network for Object Detection in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6001905. [Google Scholar] [CrossRef]
  8. Zhao, Z.; Tang, P.; Zhao, L.; Zhang, Z. Few-Shot Object Detection of Remote Sensing Images via Two-Stage Fine-Tuning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021805. [Google Scholar] [CrossRef]
  9. Zhang, S.; Song, F.; Liu, X.; Hao, X.; Liu, Y.; Lei, T.; Jiang, P. Text semantic fusion relation graph reasoning for few-shot object detection on remote sensing images. Remote Sens. 2023, 15, 1187. [Google Scholar] [CrossRef]
  10. Li, W.; Zhou, J.; Li, X.; Cao, Y.; Jin, G. Few-shot object detection on aerial imagery via deep metric learning and knowledge inheritance. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103397. [Google Scholar] [CrossRef]
  11. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  12. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  13. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  14. Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-Shot Object Detection via Feature Reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 8420–8429. [Google Scholar]
  15. Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 13–18 July 2020; pp. 9919–9928. [Google Scholar]
  16. Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 456–472. [Google Scholar]
  17. Park, D.; Lee, J. Hierarchical attention network for few-shot object detection via meta-contrastive learning. arXiv 2022, arXiv:2208.07039. [Google Scholar]
  18. Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
  19. Le Jeune, P.; Mokraoui, A. Improving Few-Shot Object Detection through a Performance Analysis on Aerial and Natural Images. In Proceedings of the European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 513–517. [Google Scholar]
  20. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  21. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  22. Li, X.; Deng, J.; Fang, Y. Few-Shot Object Detection on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5601614. [Google Scholar] [CrossRef]
  23. Zhang, T.; Zhang, X.; Zhu, P.; Jia, X.; Tang, X.; Jiao, L. Generalized few-shot object detection in remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 53–364. [Google Scholar] [CrossRef]
  24. Zhang, F.; Shi, Y.; Xiong, Z.; Zhu, X.X. Few-Shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603514. [Google Scholar] [CrossRef]
  25. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  26. Ultralytics. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 January 2024).
  27. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
  28. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  29. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 198, 379–387. [Google Scholar]
  30. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  31. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  32. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  33. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  34. Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
  35. Köhler, M.; Eisenbach, M.; Gross, H.-M. Few-Shot Object Detection: A Comprehensive Survey. IEEE Trans. Neural Networks Learn. Syst. 2023, 35, 11958–11978. [Google Scholar] [CrossRef] [PubMed]
  36. Huang, Q.; Zhang, H.; Xue, M.; Song, J.; Song, M. A survey of deep learning for low-shot object detection. arXiv 2021, arXiv:2112.02814. [Google Scholar] [CrossRef]
  37. Liu, L.; Ma, B.; Zhang, Y.; Yi, X.; Li, H. Afd-net: Adaptive fully-dual network for few-shot object detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 2549–2557. [Google Scholar]
  38. Fan, Z.; Ma, Y.; Li, Z.; Sun, J. Generalized few-shot object detection without forgetting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 4525–4534. [Google Scholar]
  39. Xia, R.; Li, G.; Huang, Z.; Meng, H.; Pang, Y. Bi-path combination YOLO for real-time few-shot object detection. Pattern Recognit. Lett. 2023, 165, 91–97. [Google Scholar] [CrossRef]
  40. Xu, H.; Wang, X.; Shao, F.; Duan, B.; Zhang, P. Few-Shot Object Detection via Sample Processing. IEEE Access 2021, 9, 29207–29221. [Google Scholar] [CrossRef]
  41. Ouyang, Y.; Wang, X.; Hu, R.; Xu, H. Few-shot object detection based on positive-sample improvement. Def. Technol. 2022, 28, 74–86. [Google Scholar] [CrossRef]
  42. Bochkovskiy, A.; Wang, C.; Liao, H.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  43. Liu, N.; Xu, X.; Celik, T.; Gan, Z.; Li, H.-C. Transformation-Invariant Network for Few-Shot Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5625314. [Google Scholar] [CrossRef]
  44. Zhang, Y.; Zhang, B.; Wang, B. Few-Shot Object Detection With Self-Adaptive Global Similarity and Two-Way Foreground Stimulator in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7263–7276. [Google Scholar] [CrossRef]
  45. Wang, Y.; Xu, C.; Liu, C.; Li, Z. Context information refinement for few-shot object detection in remote sensing images. Remote Sens. 2022, 14, 3255. [Google Scholar] [CrossRef]
  46. Lan, J.; Zhang, C.; Lu, W.; Gu, N. Spatial-Transformer and Cross-Scale Fusion Network (STCS-Net) for Small Object Detection in Remote Sensing Images. J. Indian Soc. Remote Sens. 2023, 51, 1427–1439. [Google Scholar] [CrossRef]
  47. Cheng, Y.; Wang, W.; Zhang, W.; Yang, L.; Wang, J.; Ni, H.; Guan, T.; He, J.; Gu, Y.; Tran, N.N. A Multi-Feature Fusion and Attention Network for Multi-Scale Object Detection in Remote Sensing Images. Remote Sens. 2023, 15, 2096. [Google Scholar] [CrossRef]
  48. Amjoud, A.B.; Amrouch, M. Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review. IEEE Access 2023, 11, 35479–35516. [Google Scholar] [CrossRef]
  49. Wolf, S.; Meier, J.; Sommer, L.; Beyerer, J. Double Head Predictor based Few-Shot Object Detection for Aerial Imagery. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 721–731. [Google Scholar]
  50. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  51. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detector. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
  52. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  53. Wang, X.; Wang, A.; Yi, J.; Song, Y.; Chehri, A. Small Object Detection Based on Deep Learning for Remote Sensing: A Comprehensive Review. Remote Sens. 2023, 15, 3265. [Google Scholar] [CrossRef]
  54. Cheng, G.; Yan, B.; Shi, P.; Li, K.; Yao, X.; Guo, L.; Han, J. Prototype-CNN for Few-Shot Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604610. [Google Scholar] [CrossRef]
Figure 1. Comparative visualization between the detection results of TFA (first row) versus the proposed method (second row) on the NWPUv2 test dataset under the 3-shot setting. There are many missed detections and false detections in the TFA results.
Figure 1. Comparative visualization between the detection results of TFA (first row) versus the proposed method (second row) on the NWPUv2 test dataset under the 3-shot setting. There are many missed detections and false detections in the TFA results.
Remotesensing 16 03630 g001
Figure 2. Overview of the proposed framework. In the base training stage, the entire model is trained using the base set. During the fine-tuning stage, the backbone is frozen, while the remaining parts follow the training strategy outlined in Section 3.6. SA enhances attention to foreground targets via binary map prediction and L S A calculation against the real binary mask map. Tri-Head, a Triplet Head, adopts a dual knowledge distillation mechanism to mitigate the issue of forgetting base-class knowledge. L E n C l s and L E n O b j are enhanced classification loss functions, emphasizing challenging samples.
Figure 2. Overview of the proposed framework. In the base training stage, the entire model is trained using the base set. During the fine-tuning stage, the backbone is frozen, while the remaining parts follow the training strategy outlined in Section 3.6. SA enhances attention to foreground targets via binary map prediction and L S A calculation against the real binary mask map. Tri-Head, a Triplet Head, adopts a dual knowledge distillation mechanism to mitigate the issue of forgetting base-class knowledge. L E n C l s and L E n O b j are enhanced classification loss functions, emphasizing challenging samples.
Remotesensing 16 03630 g002
Figure 3. The structure of SA.
Figure 3. The structure of SA.
Remotesensing 16 03630 g003
Figure 4. The process of obtaining labels required by SA. The figure shows the process of obtaining labels under the 3-shot setting of the NWPUv2 dataset.
Figure 4. The process of obtaining labels required by SA. The figure shows the process of obtaining labels under the 3-shot setting of the NWPUv2 dataset.
Remotesensing 16 03630 g004
Figure 5. The structure of Triplet Head. H b a s e , H o u t p u t , and H n o v e l refer to base-class detection head, final output head, and novel-class detection head, respectively. L D i s t i l l a t i o n - b a s e and L Distillation - novel are base-class distillation loss and novel-class distillation loss, respectively.
Figure 5. The structure of Triplet Head. H b a s e , H o u t p u t , and H n o v e l refer to base-class detection head, final output head, and novel-class detection head, respectively. L D i s t i l l a t i o n - b a s e and L Distillation - novel are base-class distillation loss and novel-class distillation loss, respectively.
Remotesensing 16 03630 g005
Figure 6. Training sample statistics on the DIOR dataset under the 20-shot setting in Split1.
Figure 6. Training sample statistics on the DIOR dataset under the 20-shot setting in Split1.
Remotesensing 16 03630 g006
Figure 7. Visualization of detection results of our method versus YOLOv5 and G-FSDet on the NWPUv2 dataset under the 3-shot setting.
Figure 7. Visualization of detection results of our method versus YOLOv5 and G-FSDet on the NWPUv2 dataset under the 3-shot setting.
Remotesensing 16 03630 g007
Figure 8. Results of each category in base classes on the NWPUv2 dataset under the 3-shot setting.
Figure 8. Results of each category in base classes on the NWPUv2 dataset under the 3-shot setting.
Remotesensing 16 03630 g008
Figure 9. Results of each category in novel classes on the NWPUv2 dataset under the 3-shot setting.
Figure 9. Results of each category in novel classes on the NWPUv2 dataset under the 3-shot setting.
Remotesensing 16 03630 g009
Figure 10. Confusion matrix on NWPUv2 under the 3-shot setting.
Figure 10. Confusion matrix on NWPUv2 under the 3-shot setting.
Remotesensing 16 03630 g010
Figure 11. Visualization of detection results of our method versus YOLOv5 and G-FSDet on the DIOR dataset under the 10-shot setting in Split1.
Figure 11. Visualization of detection results of our method versus YOLOv5 and G-FSDet on the DIOR dataset under the 10-shot setting in Split1.
Remotesensing 16 03630 g011
Table 1. Experiment results on the NWPUv2 test dataset.
Table 1. Experiment results on the NWPUv2 test dataset.
Methods3-Shot5-Shot10-Shot20-Shot
BaseNovelAllBaseNovelAllBaseNovelAllBaseNovelAll
Meta-YOLO [14]83.1315.3562.8082.7816.2462.8283.8924.0065.9282.8027.1666.11
P-CNN [54]82.8441.8070.5382.8949.1772.7983.0563.2978.1183.5966.8378.55
TFA w/cos [15]89.358.8065.1989.609.4964.6589.959.2665.7489.6210.8365.98
G-FSDet [23]89.1149.0577.0188.3756.1078.6488.4071.8283.4389.7375.4185.44
Ours94.1081.6790.3095.3983.2791.7095.4992.7794.7096.0191.8094.80
Bold font indicates the best results.
Table 2. Experiment results on the DIOR test dataset.
Table 2. Experiment results on the DIOR test dataset.
SplitMethods3-Shot5-Shot10-Shot20-Shot
BaseNovelAllBaseNovelAllBaseNovelAllBaseNovelAll
1Meta-YOLO [14]49.407.5038.9049.7012.1040.3049.5018.1041.7050.0022.0043.00
P-CNN [54]47.0018.0039.8048.4022.8042.0050.9027.6045.1052.2029.6046.80
TFA w/cos [15]70.3211.3555.5870.5111.5755.7870.5215.3756.7371.0717.9657.79
G-FSDet [23]68.9427.5758.6169.5230.5259.7269.0337.4661.1669.8039.8362.31
ST-FSOD [24]73.5041.9065.6073.3045.7066.4072.6050.0066.9573.3053.7068.40
Ours78.3337.6868.2078.6144.3270.0079.8645.7171.3079.0348.8871.50
2Meta-YOLO [14]48.504.8037.6046.807.036.9046.409.0037.1043.5014.1036.20
P-CNN [54]48.9014.5040.3049.1014.9040.6052.5018.9044.1051.6022.8044.40
TFA w/cos [15]70.755.7754.5170.798.1955.1469.938.7154.6370.0212.1855.56
G-FSDet [23]69.2014.1355.4369.2515.8455.8768.7120.7056.7068.1822.6956.86
ST-FSOD [24]72.5017.7058.8072.7020.7059.7072.3027.3061.0573.3033.4063.33
Ours77.0716.5161.9078.0720.4463.7079.2928.8466.7078.1134.6567.20
3Meta-YOLO [14]45.507.8036.1047.9013.7039.3044.5013.8036.8043.5018.5037.30
P-CNN [54]49.5016.5041.3049.9018.8042.1052.1023.3044.9053.1028.8047.00
TFA w/cos [15]71.958.3656.0571.6410.1356.2672.5610.7557.1173.1317.9959.35
G-FSDet [23]71.1016.0357.3470.1823.2558.4371.0826.2459.8771.2632.0561.46
ST-FSOD [24]75.2020.9061.6375.6026.0063.2075.7031.3064.6075.5034.6065.28
Ours75.9225.1263.2076.3926.3163.9077.0331.5665.7076.6637.8867.00
4Meta-YOLO [14]48.203.7037.1048.506.8038.1045.707.2036.1044.4012.2036.40
P-CNN [54]49.8015.2041.2049.9017.5041.8051.7018.9043.5052.3025.7045.70
TFA w/cos [15]68.5710.4254.0368.8514.2955.2168.5814.3555.0368.8612.0154.65
G-FSDet [23]69.0116.7455.9567.9621.0356.3068.5525.8457.8767.7331.7858.75
ST-FSOD [24]73.3020.4060.0873.5025.2061.4373.9033.4063.7873.8038.2064.90
Ours76.3315.6761.2077.4927.3665.0078.5229.6266.3078.4240.0468.80
Bold font indicates the best results.
Table 3. Comparison of detection speed on the DIOR dataset.
Table 3. Comparison of detection speed on the DIOR dataset.
MethodParams/MFlops/GFLOPsSpeed/FPS
TFA [15]60.33119.6318.00
G-FSDet [23]74.23133.5217.00
YOLOv5 [26]13.6228.0074.07
Ours-training29.1372.3051.28
Ours-inference13.6228.0074.07
Table 4. Ablation experiment results on the NWPUv2 test dataset under the 3-shot setting.
Table 4. Ablation experiment results on the NWPUv2 test dataset under the 3-shot setting.
SATri-Head L E n C l s BaseNovelAll
92.2670.4085.60
93.2781.7089.80
93.1470.8386.40
93.3971.8087.10
94.1081.6790.30
√ indicates the use of this module
Table 5. Ablation experiment results on the DIOR test dataset under the 10-shot setting in split1.
Table 5. Ablation experiment results on the DIOR test dataset under the 10-shot setting in split1.
SATri-Head L E n C l s BaseNovelAll
78.4541.0369.10
79.0042.9270.00
79.7642.2670.40
78.4343.9069.80
79.8645.7171.30
√ indicates the use of this module.
Table 6. Sensitivity analysis of hyperparameter b in L E n C l s on the DIOR test dataset under the 10-shot setting in split1.
Table 6. Sensitivity analysis of hyperparameter b in L E n C l s on the DIOR test dataset under the 10-shot setting in split1.
a b BaseNovelAll
0.513.078.4642.9969.60
11.078.4143.5469.70
10.078.4343.9069.80
9.078.4343.6069.70
7.078.1743.0069.40
Bold font indicates the best results.
Table 7. Sensitivity analysis of hyperparameter a in L E n C l s on the DIOR test dataset under the 10-shot setting in split1.
Table 7. Sensitivity analysis of hyperparameter a in L E n C l s on the DIOR test dataset under the 10-shot setting in split1.
a b BaseNovelAll
0.810.078.4343.4369.70
0.678.3843.5769.70
0.578.4343.9069.80
0.478.3943.4669.60
0.278.4243.4169.70
Bold font indicates the best results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Hong, Z.; Chen, X.; Li, Y. Few-Shot Object Detection for Remote Sensing Imagery Using Segmentation Assistance and Triplet Head. Remote Sens. 2024, 16, 3630. https://doi.org/10.3390/rs16193630

AMA Style

Zhang J, Hong Z, Chen X, Li Y. Few-Shot Object Detection for Remote Sensing Imagery Using Segmentation Assistance and Triplet Head. Remote Sensing. 2024; 16(19):3630. https://doi.org/10.3390/rs16193630

Chicago/Turabian Style

Zhang, Jing, Zhaolong Hong, Xu Chen, and Yunsong Li. 2024. "Few-Shot Object Detection for Remote Sensing Imagery Using Segmentation Assistance and Triplet Head" Remote Sensing 16, no. 19: 3630. https://doi.org/10.3390/rs16193630

APA Style

Zhang, J., Hong, Z., Chen, X., & Li, Y. (2024). Few-Shot Object Detection for Remote Sensing Imagery Using Segmentation Assistance and Triplet Head. Remote Sensing, 16(19), 3630. https://doi.org/10.3390/rs16193630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop