1. Introduction
Object detection plays a crucial role in remote sensing applications [
1,
2,
3,
4], including land-use monitoring, maritime vessel identification, forest fire early warning, and disaster assessment [
5,
6,
7]. However, conventional object detection methods typically rely on large-scale annotated datasets, which are costly to obtain for novel or rare objects in remote sensing imagery [
8,
9]. Moreover, certain categories may have extremely limited sample availability, making it difficult for models to learn and generalize effectively. Consequently, there has been a growing interest in few-shot object detection (FSOD) within the remote sensing community in recent years [
10,
11].
FSOD aims to enhance object detection performance while reducing dependence on large-scale annotations [
12,
13]. The key idea is to integrate object detection [
14,
15] with few-shot learning [
16,
17,
18,
19,
20], enabling models to rapidly adapt to new categories using only a few labeled examples. Most FSOD approaches adopt a two-stage learning paradigm: first, pretraining on base classes to acquire generalized features, followed by fine-tuning on novel classes with limited data to improve recognition capability [
21,
22]. Based on different learning strategies, FSOD can be broadly classified into meta-learning [
23,
24,
25] and fine-tuning approaches [
26,
27,
28]. Meta-learning enhances generalization through task-level optimization, whereas fine-tuning adjusts only specific parameters of the detector while keeping the pretrained backbone fixed to refine detection performance. FSOD not only improves detection accuracy under data-scarce conditions but also adapts effectively to multi-sensor data, complex backgrounds, and multi-scale objects, making it a key research direction in remote sensing object detection.
Despite advancements, existing FSOD methods still face significant challenges in adapting to remote sensing images [
25,
28]. The primary issue lies in their inability to simultaneously learn both fine-grained local features and global structural representations. Due to the limited training samples in FSOD, novel category instances often cover only partial object regions, preventing the model from reconstructing a complete object shape during inference [
24,
25,
26]. For instance, as illustrated in
Figure 1, training samples of a novel category in remote sensing imagery may only include the tail of an aircraft or part of a ship, while at the testing stage, the model is required to recognize the entire vessel or complete building. However, without complete object representations in training, the model tends to focus only on the seen regions, neglecting unseen parts, which leads to incomplete object detection.
Moreover, current FSOD methods suffer from unstable feature adaptation in the global feature space, causing novel categories to be misclassified as background. Given the scarcity of FSOD training samples, the model struggles to acquire sufficient feature support for novel categories, leading to a distribution mismatch between base and novel categories in the global feature space. This discrepancy makes it challenging to differentiate novel objects from the background, ultimately reducing detection performance. Additionally, as shown in the bottom part of
Figure 1, FSOD training data often contain missing annotations or noisy bounding boxes. These issues distort the semantic supervision signal and lead to feature drift in the global feature space, where novel class instances shift closer to the background or base classes. This misalignment increases the difficulty of distinguishing novel objects, ultimately degrading detection performance.
To overcome these challenges, we propose a novel FSOD framework that jointly enhances local feature learning and global feature adaptation to address both local feature incompleteness and global adaptation instability in novel category detection. By simultaneously focusing on local feature enhancement and global feature optimization, our framework enables a more comprehensive understanding of novel categories under limited data conditions. Specifically, our framework incorporates two key components: Extensible Local Feature Aggregator Module (ELFAM) for optimized local feature learning and Self-Guided Novel Adaptation (SGNA) for improved global feature adaptation. First, ELFAM enhances local feature learning through a local feature aggregation mechanism, dynamically modeling relationships among different local regions. This allows the model to capture more latent details even with limited samples. Additionally, ELFAM leverages co-existing relationships within base classes to automatically complete missing local information during training, thereby improving the structural integrity of novel objects and ensuring a more complete object representation during inference. Next, we introduce SGNA, which enhances the global feature adaptation capability of novel categories. SGNA utilizes a teacher-student collaborative optimization strategy, where pseudo novel proposals generated by the teacher model improve GT quality and refine the feature distribution of novel categories, ensuring more stable adaptation in the global feature space. Furthermore, we incorporate the Teacher-Guided Dual-Branch Head (TG-DH), which leverages pseudo novel labels provided by the teacher model to optimize both object classification and bounding box regression losses, thereby enhancing detection accuracy and generalization ability for novel categories. By jointly optimizing local and global feature learning, our method enables the model to capture fine-grained details of novel categories while ensuring stable adaptation in the global feature space. As a result, our framework significantly improves few-shot object detection performance in remote sensing tasks. We conducted extensive experiments on multiple publicly available remote sensing datasets. Extensive evaluations on public remote sensing benchmarks confirm the effectiveness of our approach under various few-shot settings, particularly in detecting novel categories with limited supervision. These results highlight the core innovations behind our framework, which are summarized as follows:
We propose a novel framework that simultaneously addresses structural incompleteness and feature misalignment in few-shot object detection for remote sensing imagery by enhancing local structure modeling and stabilizing global semantic adaptation.
The framework incorporates three innovative modules designed to enhance feature learning synergy: (1) an Extensible Local Feature Aggregator Module, which facilitates fine-grained structural representation through dynamic multi-scale feature aggregation; (2) a Self-Guided Novel Adaptation module, which enhances global feature alignment via teacher-student collaborative optimization; and (3) a Teacher-Guided Dual-Branch Head, which decouples base and novel class adaptation using stable pseudo-supervision to improve generalization.
We conduct extensive experiments on DIOR and iSAID, evaluating our proposed method under various few-shot settings. The results demonstrate that our method outperforms state-of-the-art approaches on evaluation metrics and achieves notable performance improvements.
The rest of this paper is organized as follows:
Section 2 reviews the related work, with a focus on few-shot object detection (FSOD) and its application to remote sensing.
Section 3 introduces our proposed framework, elaborating on the ELFAM and SGNA modules for enhancing local and global feature representations.
Section 4 presents the experimental results, including ablation studies and comparisons on the DIOR and iSAID datasets. Finally,
Section 5 concludes the paper by summarizing the key contributions and discussing future directions.
3. Method
In this section, we first provide an overview of the proposed framework in
Section 3.1. Then, we introduce the formulation of FSOD in
Section 3.2. Next, we present our proposed Extensible Local Feature Aggregator Module (ELFAM) to enhance local feature learning in
Section 3.3. Subsequently, we describe our Self-Guided Novel Adaptation (SGNA) for optimizing global feature adaptation in
Section 3.4. Finally, we analyze the overall model training and optimization process in
Section 3.5.
3.1. Overview
The overall architecture of our few-shot object detection framework is illustrated in
Figure 2. The framework follows a two-stage design, consisting of a base training stage for learning transferable representations from base categories and a fine-tuning stage for adapting the model to novel categories with limited labeled samples. In both stages, the Extensible Local Feature Aggregator Module (ELFAM) is employed to enhance the completeness of local feature representations by recursively aggregating multi-scale contextual information. During fine-tuning, a Self-Guided Novel Adaptation (SGNA) RPN is introduced to generate high-quality pseudo proposals via a teacher-student paradigm, thereby mitigating feature drift in the global semantic space. In parallel, a Teacher-Guided Dual-Branch Detection Head (TG-Dual BBox Head) supervises classification and regression by fusing pseudo and ground truth labels, enhancing detection stability and generalization on novel classes.
3.2. Problem Formulation
FSOD aims to reduce the dependency of conventional detectors on large-scale annotations by facilitating the recognition of novel categories from only a handful of labeled examples. Specifically, given a dataset , where x represents the input image and contains the class label c and corresponding bounding box b, the FSOD task divides the dataset into base classes and novel classes , which correspond to the base dataset and the novel dataset , respectively. These two sets are non-overlapping, meaning , and novel class objects remain unseen during the base training stage.
In FSOD tasks, base classes typically have abundant labeled data, whereas novel classes contain only a few annotated samples, often ranging from 1 to 10 instances per class. Due to this extreme data imbalance, models are prone to overfitting when learning novel classes, leading to reduced generalization capability. This tendency primarily arises because the limited number of annotated instances in novel classes constrains the model’s ability to capture intra-class variations, which often leads the detector to memorize the few available samples instead of learning generalizable features. The problem is further compounded by the strong inductive bias inherited from abundant base class training, which may dominate the adaptation process during fine-tuning. Moreover, novel class samples may only cover partial regions of an object, resulting in incomplete object shape information. This makes it challenging for the model to accurately recognize the entire structure of novel objects during inference. To address these challenges, most FSOD methods follow a two-stage fine-tuning paradigm [
26,
27,
28,
39], where a general object detection model is first pretrained on the base dataset and subsequently fine-tuned on the novel dataset with limited samples. The training pipeline can be formulated as follows:
where
represents the initialized object detection model,
is the detector trained on base classes, and
is the final FSOD model fine-tuned on novel categories. During the base training stage, the model learns generalizable object features from
and acquires strong detection capabilities. In the fine-tuning stage, the model learns novel categories from
, but due to the extremely limited data, it must leverage the knowledge from base categories to enhance the accuracy and generalization ability for novel class detection.
3.3. Extensible Local Feature Aggregator Module
To enhance local feature representation in few-shot object detection (FSOD), we propose the Extensible Local Feature Aggregator Module (ELFAM), which dynamically aggregates local feature information and progressively expands to construct a complete object representation. ELFAM operates on multi-scale feature maps and optimizes target features through an adaptive local feature aggregation mechanism.
(a) Multi-Scale Feature Expansion via Attention Aggregation
Given an input image
I, the feature extractor generates a set of multi-scale feature maps
, where each
represents the feature map at level
l. As shown in
Figure 3, ELFAM progressively expands the attention region, starting from a small set of keypoints and gradually expanding to cover the entire target region.
Unlike conventional attention mechanisms that operate within a single feature scale, ELFAM performs recursive and progressive attention expansion across multiple scales, guided by local context cues and cross-scale feature dependencies. This design enables the model to iteratively grow attention regions from semantically informative keypoints and adaptively integrate multi-scale information, thereby constructing a complete and structurally coherent object representation under few-shot conditions. Specifically, at the lowest scale
, an initial set of keypoints
is selected, and ELFAM recursively expands the attention field from these local regions in various directions. The attention computation at each level depends on the previously captured regions:
where
represents a function that expands the attention region at scale
t based on the previous attention field
. At each layer, the expansion direction is adaptive, dynamically adjusting based on object geometry, local feature distribution, and historical attention weights. This ensures that the model captures both the coarse outline at lower resolutions and finer details at higher resolutions.
To further describe the multi-scale adaptive attention expansion, we define the attention expansion operation
. Unlike conventional detectors that rely on fixed or heuristic keypoint proposals, ELFAM treats the initial local keypoints as learnable query positions on the multi-scale feature maps. These keypoints are jointly optimized during training through standard detection losses, including classification and regression supervision, enabling the network to discover semantically meaningful anchors for attention propagation. This learnable design allows the model to automatically refine keypoint locations, thereby mitigating the risk of error accumulation caused by suboptimal initialization. We enhance the attention expansion process by incorporating learnable spatial offsets
[
34,
48], allowing each attention direction to flexibly shift the query position in the feature space. This enables the model to attend to semantically meaningful locations beyond the rigid neighborhood and improves its ability to recover complete object structures from partial observations. The updated attention operation is formulated as:
where
represents the query feature point at level
l, serving as the initial reference for attention expansion.
denotes the learnable spatial offset at direction n, allowing the attention to dynamically adjust its sampling position in the feature space.
and
denote the key and value transformations in the attention layer, which are applied after spatial shifting.
defines the number of directions used at each scale, ensuring flexible local expansion.
refers to neighboring scales providing additional contextual support, and
is a learnable cross-scale attention weight controlling the contribution from adjacent levels. Finally,
is a balancing factor regulating the transition between local expansion and global feature integration.
(b) Local Feature Aggregation Mechanism
After expanding the attention region across multiple scales, the local feature aggregation mechanism ensures effective fusion of key features, leading to a complete object representation. The aggregated feature at each level is computed as:
where
is a learnable weight for scale
t,
denotes neighboring scales, and
is an adaptive weight controlling the contribution of scale
j to scale
t. This aggregation method ensures that local information across scales is effectively combined, enhancing the completeness of novel object representations.
The final object representation is obtained by aggregating features across all scales:
where
is a learnable coefficient that adjusts the contribution of each scale. This multi-scale fusion approach effectively captures fine-grained information while preserving the complete object structure, enabling the model to balance local detail expression and global target consistency in few-shot object detection tasks. This enhancement significantly improves detection performance, particularly for remote sensing images with varying scales and orientations.
3.4. Self-Guided Novel Adaptation RPN
To address the challenge of insufficient global feature representation for novel classes in few-shot object detection (FSOD), we propose a Self-Guided Novel Adaptation (SGNA) mechanism. SGNA employs a teacher–student collaborative framework with pseudo-label supervision to dynamically enhance the global distribution and stability of novel class features. As shown in
Figure 4, the SGNA module consists of three proposal generation branches: a fixed Base RPN, a trainable Student RPN, and a Teacher RPN updated via Exponential Moving Average (EMA). The Base RPN preserves knowledge learned from base classes, the Student RPN is optimized under label supervision, and the Teacher RPN provides stable pseudo-labels to assist learning.
During each training iteration, the Teacher RPN receives feature maps and outputs a set of region proposals
, where each proposal
is associated with a confidence score
. High-quality pseudo novel proposals are selected by thresholding the scores using a predefined threshold
, forming the pseudo-label set:
These pseudo proposals are then combined with the original ground truth annotations of novel classes
, resulting in an enhanced supervision label set:
The fused label set
is used to supervise the Student RPN. Meanwhile, the Base RPN continues to be supervised using the original ground truth annotations from base classes
, in order to maintain detection capability on previously learned categories. The total detection loss is computed on both student and base predictions, including classification and bounding box regression losses:
where
is the cross-entropy loss and
is the smooth L1 loss. This composite loss formulation ensures that the student branch benefits from enhanced pseudo-labeled supervision while the base branch retains stable learning of prior categories, achieving a balanced adaptation in the global feature space. This pseudo-label generation, label fusion, and supervision pipeline are explicitly visualized as three orange paths in our framework diagram, which denote the backward propagation of gradients from pseudo-labels to student parameters. This mechanism enables the model to leverage both annotated and pseudo-labeled data to improve its understanding of novel object structures and enhance global feature consistency.
To ensure the stability of pseudo-label generation, the parameters of the Teacher RPN, denoted as
, are updated via an Exponential Moving Average of the Student RPN’s parameters
:
where
is the smoothing coefficient. Finally, the region proposals from the Student RPN and the Base RPN are merged to form the final proposal set for downstream classification and regression. By integrating stable structure, supervision enhancement, and self-guided optimization, SGNA effectively improves the global representation of novel classes and significantly boosts detection accuracy and generalization in FSOD for remote sensing tasks.
The training process of SGNA can be regarded as a self-guided pseudo-label optimization pipeline, where the teacher branch identifies potential novel object regions through high-confidence predictions to enrich and complete the supervisory signals. Meanwhile, the student branch is updated under this enhanced supervision and, in turn, refines the teacher parameters via EMA updates, enabling the model to progressively extract meaningful structural features from noisy pseudo-labels. This mechanism forms a closed-loop process that integrates pseudo-label generation, enhanced label construction, and supervised backpropagation, allowing the model to gradually narrow the distribution gap between base and novel categories and improve the separability and stability of novel features in the global semantic space. As illustrated in
Figure 4, the three orange arrows depict the supervisory flow from pseudo-labels to enhanced supervision and finally to model optimization.
By introducing the SGNA module, the model maintains structural integrity and distributional stability of novel category features under extremely low annotation conditions. This effectively mitigates adaptation drift caused by data scarcity and provides more discriminative and generalizable feature representations for novel object detection.
3.5. Teacher-Guided Dual-Branch Head
While SGNA enhances the quality of region proposals for novel categories, the detection head still struggles to model complete object structures under extremely limited supervision. To further refine novel object representations and enhance structural consistency, we propose the Teacher-Guided Dual-Branch Head (TG-DH), which leverages stable predictions from a teacher branch to construct high-quality supervision for the student detection head.
As illustrated in
Figure 5, TG-DH takes the Regions of Interest (RoIs) generated by SGNA RPN and RoI Align, and feeds them into two parallel detection heads with identical architectures: a Student head and a Teacher head. The Teacher head produces refined pseudo-labels from stabilized predictions, which are then fused with original ground truth annotations to form an enhanced label set
. This enriched supervision set provides more complete and discriminative training signals for the Student head.
The Teacher head is updated via an Exponential Moving Average (EMA) of the Student head to maintain temporal stability:
where
is the EMA momentum. The final loss is computed on the Student head using the fused label set
, combining classification and regression losses:
By continuously updating the Teacher head with stabilized knowledge and using its predictions to refine supervision, TG-DH progressively enhances the structural completeness and localization accuracy of novel objects. Together with SGNA, it forms a unified pipeline that transitions from proposal-level adaptation to RoI-level fine-grained representation learning. When used in conjunction with SGNA, TG-DH further refines RoI-level representations based on improved region proposals, forming a progressive optimization path from coarse adaptation to fine-grained detection. This joint design provides a complete solution that improves both structural fidelity and localization robustness for novel categories under low-data regimes.
4. Experiments
This section reports detection results and evaluation insights obtained on three benchmark remote sensing datasets under different few-shot settings. We analyze the detection performance, compare against state-of-the-art methods, and provide ablation studies to validate the effectiveness of each component in our framework.
4.1. Datasets and Evaluation Protocol
We conduct evaluations on two widely used remote sensing datasets DIOR [
49] and iSAID [
50] to assess the effectiveness of our method under diverse few-shot detection scenarios. These datasets differ significantly in terms of spatial resolution, object density, scene complexity, and category granularity, thus providing comprehensive coverage of real-world remote sensing environments.
DIOR contains 23,463 optical remote sensing images with 20 annotated object categories, including both human-made and natural targets (e.g., airplanes, ships, bridges, and storage tanks). The images have varying spatial resolutions ranging from 0.5 m to 30 m and exhibit complex backgrounds. We adopt two standard few-shot settings: the first uses 5 manually selected novel categories (e.g.,
airplane,
baseball field) with the remaining as base classes [
51], while the second introduces four random base-novel splits to evaluate generalization robustness across different category combinations [
52].
iSAID consists of 2806 large-scale satellite images with 655,451 densely annotated object instances across 15 categories. The dataset includes fine-grained objects such as small vehicles, ships, and storage tanks in densely populated urban and port areas. The original images are divided into
tiles with 25% overlap, balancing memory efficiency and spatial continuity. We follow the standard FSOD setting by adopting three base-novel class splits and evaluating under 10-shot, 50-shot, and 100-shot configurations [
46].
Both datasets are evaluated under a two-phase FSOD setting, where the model is first trained on base classes and then adapted to novel categories with limited samples. Performance is evaluated using Average Precision (AP), computed as the area under the precision–recall (PR) curve. Precision and recall are calculated across confidence thresholds by matching predicted boxes to ground truth boxes with an Intersection over Union (IoU) threshold of 0.5.
4.2. Experiment Setting and Implementation Details
Our framework is built upon the Faster R-CNN [
5] detector with a ResNet-101 [
53] backbone pretrained on ImageNet. To extract hierarchical and multi-resolution features, we incorporate a Feature Pyramid Network (FPN) [
54], enabling effective object representation across different spatial scales. The proposed Extensible Local Feature Aggregator Module is integrated on top of the multi-scale feature maps and comprises four stacked Transformer encoder layers. Each encoder layer utilizes a 256-dimensional embedding and performs attention aggregation from recursively expanded keypoint regions. These keypoints are implemented as learnable query positions, which are adaptively optimized during training through standard detection loss supervision.
For region proposal generation during the fine-tuning stage, the Self-Guided Novel Adaptation RPN adopts a dual-branch structure consisting of two identical convolutional subnetworks. These branches serve as the student and teacher, respectively, and are trained under a knowledge distillation scheme. The teacher branch is updated using an exponential moving average (EMA) of the student weights and provides pseudo-labels for the novel class proposals based on high-confidence predictions. Additionally, the Teacher-Guided Dual-Branch Head also follows a student–teacher configuration. Both branches independently process the ROI-aligned features and consist of a fully connected classification head and a bounding box regression head. Pseudo-labels from the teacher branch are fused with available ground truth labels to supervise the student branch, enhancing novel class adaptation while preserving base class knowledge. Optimization is performed using the AdamW optimizer, with an initial learning rate of , a weight decay coefficient of 0.01, and a step-wise decay schedule that reduces the learning rate by a factor of 0.1 at predefined iteration milestones.
Training follows a two-stage FSOD pipeline consisting of base training and few-shot fine-tuning. For the base stage, models are trained on DIOR and iSAID for 10 k, 40 k, and 80 k iterations, respectively, with learning rate decays scheduled at 24 k and 32 k iterations on DIOR, and at 40 k and 60 k on iSAID. The fine-tuning phase is conducted for 10 k iterations across both datasets. All input images are uniformly resized to
pixels, and augmented through multi-scale resizing (scale factors in [0.5, 2.0]), random horizontal flipping, and fixed-angle rotations (90°, 180°, 270°) to improve generalization. Batch sizes are configured as 16 during base training and 8 during fine-tuning. To stabilize pseudo-label generation in SGNA, we apply an Exponential Moving Average (EMA) update [
55] with momentum coefficient
to maintain a temporally smoothed teacher model. During pseudo-label filtering, only candidate boxes with confidence scores above a threshold of
are retained to supervise the student branch.
Our implementation is built upon PyTorch version 1.13.0, and leverages EarthNets [
56] and MMDetection toolkits [
57]. Additional details and code will be made publicly available for reproducibility.
4.3. Quantitative Results
We evaluate the performance of our method on the DIOR and iSAID datasets, as summarized in
Table 1,
Table 2 and
Table 3. The results show that our method shows superior performance across different few-shot settings, particularly for novel classes.
As shown in
Table 1, our method achieves superior performance on the DIOR dataset, surpassing all baselines under the 3-, 5-, 10-, and 20-shot settings. For example, at 20-shot, it reaches 65.2% AP on novel classes, which is 3.9 points higher than the best-performing baseline ST-FSOD.
Table 2 further demonstrates consistent superiority across different base-novel class splits. Under the challenging 3-shot and 5-shot settings, our method achieves significant gains, obtaining 32.1 AP on split3 with 3-shot and 34.6 AP with 5-shot.
On the iSAID dataset (
Table 3), our approach delivers clear improvements under all shots. Compared to ST-FSOD, it achieves average gains of 5.5 AP under 10-shot, 4.6 AP under 50-shot, and 1.9 AP under 100-shot, reflecting strong generalization capability in dense and complex aerial scenes. While our method consistently outperforms existing baselines on novel classes across all few-shot settings, we also observe a slight decrease in base class performance compared to ST-FSOD under certain configurations. This phenomenon reflects a common challenge in few-shot transfer learning: adapting to novel categories with limited supervision can perturb the learned feature space, leading to marginal degradation of base class representations. In our case, the use of pseudo-labels for novel classes may introduce subtle shifts in feature distributions that interfere with base class stability. Despite this, our method maintains a favorable balance between generalization and retention, ultimately achieving superior overall detection accuracy.
These results validate the superiority of our method compared to existing few-shot detection approaches, particularly under scenarios with incomplete annotations and limited supervision. The improvements across different shot settings can be attributed to the joint enhancement of local structural modeling and global semantic adaptation in our framework. Specifically, the incorporation of multi-scale recursive attention aggregation enables more complete object reconstruction from sparse samples, while the pseudo-label-driven feature alignment strategy mitigates feature drift and improves novel-category discrimination. This comprehensive design allows our method to achieve better generalization and higher detection accuracy across diverse novel classes in remote sensing imagery.
4.4. Qualitative Results
We present qualitative comparisons of detection results on the DIOR and iSAID datasets, as illustrated in
Figure 6 and
Figure 7. These visualizations provide clear evidence of the effectiveness of the proposed method:
Our method improves structural completeness by recursively expanding attention from local keypoints across multiple feature scales, allowing the model to reconstruct full object shapes even when training data provides only partial visual cues. In the third column of the novel class in
Figure 6, the
airplane is often misclassified as background by standard few-shot detectors due to limited supervision, whereas our method successfully recovers its complete structure with precise localization. Compared to traditional detectors, which often produce fragmented or partial boxes, our ELFAM module progressively expands attention from local keypoints across scales, enabling complete and coherent object representations.
It enhances object recall in challenging contexts such as dense urban areas and mountainous regions by recovering small or ambiguous targets that are frequently missed by conventional detectors. The fourth and fifth columns of the novel class of
Figure 6, show dense urban and mountainous scenes where conventional detectors miss numerous small or background-blended targets. Our method recovers more valid objects such as buildings and industrial facilities, indicating strong robustness to visual sparsity and background interference.
Our method maintains stable detection performance for base classes even after fine-tuning novel categories. As shown in
Figure 6 and
Figure 7, it still achieves high-quality detection for base class targets. This confirms that our candidate separation strategy effectively mitigates negative transfer during fine-tuning, enabling the model to adapt to novel categories without compromising base class performance.
4.5. Ablation Studies
An ablation study is performed on the first split [
51] of the DIOR dataset to assess the individual contribution of each component within the proposed framework to overall detection performance. The quantitative results are summarized in
Table 4, while the qualitative comparisons are illustrated in
Figure 8 and
Figure 9. Based on these observations, we summarize the following conclusions:
When using only the basic fine-tuning strategy, the model suffers from insufficient supervision for novel classes, leading to low recall and frequent missed detections. As observed in
Figure 8 (second row), novel class instances like
windmill and
trainstation are often missed or partially detected. Meanwhile, base class performance remains relatively stable but still fluctuates due to interference from unstable novel class adaptation.
With the addition of the ELFAM and SGNA modules, the model achieves significant improvements in structural completeness and semantic consistency. ELFAM enhances the ability to reconstruct full object shapes by aggregating local features across scales, which is especially beneficial for partially annotated or small targets. SGNA introduces high-confidence pseudo-labels and teacher–student optimization, helping to suppress background confusion and improve feature alignment. As shown in the middle rows of
Figure 8, both recall and localization quality improve notably.
As shown in the fifth row of
Table 4, applying the TG-DH module individually yields notable improvements in detection performance on novel classes, attributed to the enhanced supervision provided by the teacher-guided pseudo-labels. However, this improvement is accompanied by a considerable decline in base class accuracy, suggesting that directly reinforcing novel class feature representations may disrupt the original feature distributions learned for base classes, leading to negative transfer. In contrast, the sixth row of
Table 4 indicates that incorporating SGNA alongside TG-DH substantially alleviates this issue by promoting more stable global feature adaptation across both base and novel categories. These results demonstrate the complementary effects of SGNA and TG-DH in promoting global semantic consistency and improving feature adaptation under few-shot object detection settings.
When all modules are integrated, the model produces the most accurate and complete detections across novel categories, while maintaining stable base class performance. As demonstrated in the final row of
Figure 8, predictions exhibit clearer boundaries, fewer false positives, and stronger generalization across diverse object types, validating the effectiveness and complementarity of the proposed components.
To provide further insights into the detection behavior,
Figure 9 presents the class-wise precision–recall curves on DIOR and iSAID datasets. On DIOR, high-AP categories such as
airplane and
tenniscourt exhibit sharp and stable precision–recall profiles, indicating strong discriminative feature learning. In contrast, more challenging categories like
trainstation and
windmill show lower precision and earlier recall degradation, reflecting the inherent difficulty in detecting small or densely distributed objects. On iSAID, despite overall lower APs caused by increased scene complexity, our method maintains stable detection performance across novel classes, demonstrating robustness under few-shot conditions.
To further validate the effectiveness of our framework in addressing practical challenges in few-shot object detection, we provide additional qualitative comparisons against a representative baseline method in
Figure 10. From left to right, the examples span different novel categories in the DIOR dataset. In the first three columns, the baseline produces bounding boxes that are inaccurately positioned or excessively large, reflecting typical feature misalignment. In contrast, our method yields tighter and more precisely localized detections by leveraging the SGNA and TG-DH modules, which promote refined cross-scale alignment and stable semantic adaptation. In the last two columns, the baseline fails to detect several valid targets, highlighting the issue of structurally incomplete predictions caused by sparse or partial supervision. Our approach effectively reconstructs complete object structures by recursively expanding from semantically informative keypoints via the ELFAM module. These results further demonstrate our framework’s capability in mitigating both structural incompleteness and feature misalignment under few-shot settings.
4.6. Hyperparameters Sensitivity
A sensitivity analysis is carried out to investigate the effect of two key hyperparameters in the proposed framework: the confidence threshold
used for pseudo-label filtering in the SGNA module, and the momentum coefficient
employed in the EMA [
55] update of the teacher model. The results on split1 of the DIOR benchmark are summarized in
Table 5 and
Table 6, respectively. The results show stable performance across settings.
offers the best trade-off between novel and base classes, while
achieves strong overall performance across most settings, offering a balanced trade-off between novel and base classes. Meanwhile,
tends to favor base class stability, albeit with slightly reduced novel class detection accuracy in some cases. Therefore, we can conclude that our method is robust to hyperparameter variation and does not require extensive tuning to achieve strong and consistent performance.
4.7. Computational Cost Analysis
To analyze the computational efficiency of each proposed component, we examine both training and inference behaviors. Specifically, we measure the average time consumed per iteration during the fine-tuning phase, as well as the inference throughput, quantified by the number of images processed per second. These measurements help quantify the computational cost associated with different module configurations. The complete results are summarized in
Table 7.
As shown, ELFAM introduces only marginal overhead, increasing training time slightly from 0.125 to 0.207 seconds per image while maintaining a high inference speed of 15.6 FPS. This is expected, as ELFAM mainly involves localized attention computations over multi-scale features, which are computationally lightweight and efficiently parallelizable.
In contrast, SGNA introduces a more noticeable increase in training time (up to 0.287 seconds per image), due to the inclusion of a teacher–student collaborative path. However, since the number of fine-tuning iterations in FSOD is relatively small compared to the base training phase, the overhead remains acceptable. The decrease in inference speed (down to 8.2 FPS) is also moderate and primarily caused by the additional pseudo-label generation and filtering procedures. These operations can be further optimized with CUDA-based acceleration.
When combining ELFAM and SGNA, the training time rises to 0.378 seconds per image and the inference speed drops to 7.5 FPS. Adding the TG-DH detection head contributes slightly more overhead (0.39 training time, 7.6 FPS), but this is offset by the overall performance gains observed in detection quality.
Overall, the added cost introduced by our modules remains within a practical range and can be effectively traded off for the observed improvements in detection completeness and robustness.