1. Introduction
Object detection in remote sensing imagery aims to identify and localize targets such as airplanes, ships, vehicles, and buildings from aerial or satellite images [
1]. This task plays a critical role in various real-world applications, including military surveillance, urban planning, disaster monitoring, and environmental assessment. Compared to natural images, remote sensing images present unique challenges such as large variations in scale, complex backgrounds, high object density, and significant inter-class similarity [
2,
3]. While deep learning-based detectors, especially convolutional neural networks (CNNs), have achieved remarkable progress in recent years, their success often depends on large-scale annotated datasets. However, acquiring such annotations in the remote sensing domain is time-consuming, labor-intensive, and sometimes infeasible [
4,
5].
Few-shot learning provides a principled paradigm to alleviate label scarcity by enabling models to generalize from only a few labeled examples per class. In remote sensing, this capability is particularly critical for time-sensitive and dynamically evolving scenarios, such as rapid post-disaster assessment or military reconnaissance, where novel object categories (e.g., new target types, camouflaged equipment, emerging infrastructure) must be recognized with very limited supervision [
5,
6,
7]. In such cases, conventional fully supervised detectors trained on static, closed-set benchmarks struggle to adapt, motivating a dedicated line of research on few-shot object detection in remote sensing imagery (RS-FSOD) [
8,
9]. Compared with few shot object detection in natural images, RS-FSOD must additionally contend with larger scale ranges, more complex and heterogeneous backgrounds (urban, rural, maritime, mountainous), resolution induced domain shifts and extremely high target density in very high-resolution (VHR) imagery [
10]. These factors make RS-FSOD not just a direct extension of generic FSOD, but an independent and more challenging research problem.
Few-shot object detection (FSOD) extends few-shot learning by jointly solving classification and bounding-box localization [
8,
11]. Existing FSOD approaches can be broadly grouped into meta-learning-based and transfer learning-based methods [
8]. On the meta-learning side, MetaYOLO [
12] learns class-agnostic feature importance and dynamically updates detector weights from a few support examples, while Meta R-CNN [
13] infers class-specific attention vectors from support images and uses them to modulate RoI features. FsDetView [
14] further introduces a class encoder and query encoder whose outputs are fused by channel-wise multiplication, difference, and concatenation, and Prototype-CNN [
15] employs class prototypes to guide proposal generation in remote sensing scenes. However, these meta-learning methods mostly inject class-conditional cues at a late RoI stage or via predominantly linear fusion, and rarely perform explicit, support-conditioned multi-scale allocation in the proposal process, which can limit recall under cluttered, scale-diverse remote-sensing scenes.
Transfer learning-based methods instead adapt a pre-trained detector to novel classes with limited annotations. Chen et al. [
16] combine SSD and Faster R-CNN with background suppression regularization to improve data efficiency, while TFA [
17] fine-tunes only the final layers with instance-level feature normalization to reduce intra-class variance and catastrophic forgetting. For remote sensing images, Zhao et al. [
18] introduce an involution-based backbone with path aggregation and shape bias, and Zhang et al. [
19] design a framework that couples metric-based loss with representation compensation and knowledge distillation. Despite these advances, most transfer-learning pipelines remain support-agnostic at the multi-scale feature selection stage and they do not explicitly use class-specific cues to steer the feature pyramid or RPN toward appropriate scales for novel categories, which is particularly problematic for small or densely distributed targets in remote sensing imagery.
From the perspective of how support information is fused into the query stream, FSOD methods have evolved along a relatively clear trajectory. Early work adopted shallow, almost plug-in fusion strategies: support and query features were concatenated, added or multiplied channel-wise, and then passed to a linear detection head [
12,
17]. These designs are easy to optimize, but the support signal only weakly modulates the query features and tends to under-utilize the limited support set, especially when novel categories are visually similar, heavily cluttered, or small in scale as in remote sensing scenes. After that, prototype-based conditioning is introduced. Meta R-CNN [
13] learns class-specific attention vectors from support images to reweight RoI features, FsDetView [
14] fuses a class encoder and a query encoder via multiple elementwise branches, and prototype-guided RPNs [
15] bias proposal generation using learned class prototypes. These approaches push support–query interaction deeper into the network, but fusion remains mostly linear and is often applied only at a late RoI stage.
More recently, non-linear and higher-order fusion mechanisms have begun to attract attention. Cross-attention modules [
20] condition one feature stream on another via query–key–value interactions, enabling adaptive, context-aware modulation of query features by the support set. Bilinear and second-order pooling modules [
21] explicitly model multiplicative channel relationships and richer support–query statistics, and have been explored to enhance fine-grained discrimination in FSOD. However, naive applications of cross-attention and bilinear fusion are typically parameter-intensive and data-hungry, which is problematic in RS-FSOD where annotations are scarce and domain shifts are pronounced. Moreover, many existing architectures still inject support information only at the RoI head, leaving the Region Proposal Network (RPN) and multi-scale feature pyramid largely support-agnostic, which is problematic when remote sensing targets are small, densely distributed, and heavily scale-dependent. In this trajectory, our proposed framework can be viewed as a lightweight, remote-sensing–oriented continuation.
Meta-learning is designed specifically for scenarios with extremely limited samples [
22]. Rather than being categorically “better” than transfer learning, meta-learning offers a complementary route that emphasizes fast adaptation from few examples while transfer learning preserves base-class knowledge via careful fine-tuning. Despite recent advances, RS-FSOD on meta-learning still faces three key issues. First, current methods struggle to effectively utilize support set information to enhance the RPN module, and the feature fusion between support and query images remains suboptimal [
9]. Second, large-scale variations and strong intraclass diversity in remote sensing imagery are not well addressed, leading to reduced detection robustness. While stronger backbones and path aggregation with shape bias [
18] improve multi-scale robustness, these components are largely support-agnostic and thus do not explicitly encode class-conditioned scale feature for novel categories. Third, the fusion between RoI features and support features is often overly simplistic, limiting the model’s ability to capture their complex relationships and adapt to novel classes. Common heads rely on cosine similarity or single-branch elementwise fusion (e.g., multiplication, difference, or concatenation as in [
14]; lightly fine-tuned linear heads as in [
17]), which underuse complementary signals.
To address the above three issues, we propose an Enhanced Feature Fusion Module (EFFM) that integrates class-specific prior information from support features into the query features, thereby enhancing the query’s responsiveness to support instances while preserving the structural information of the query image. The EFFM consists of two sub-modules: the Multi-scale Fine-grained Support Feature Extraction Module (MFSFEM) and the Feature Allocation Fusion Mechanism (FAFM). MFSFEM is responsible for extracting rich and detailed support features across multiple scales, aiming to tackle the challenges of significant intra-class variations and small inter-class differences in remote sensing imagery. FAFM then adaptively allocates and fuses these extracted features into the query representation, ensuring more precise alignment and interaction between support and query features.
On top of this, we further introduce a Non-Linear Fusion Module (NLFM), which improves upon conventional RoI feature fusion strategies in meta-learning. NLFM enhances the model’s representational capacity and better captures complex semantic relationships, allowing it to more effectively handle the intricate and heterogeneous features present in remote sensing images.
The main contributions of this paper can be summarized as follows: (1) We propose an Enhanced Feature Fusion Module composed of MFSFEM and FAFM, which jointly extract and inject fine-grained, multi-scale support features into query representations for better detection of novel objects. (2) We introduce a Non-Linear Fusion Module that models complex interactions between support and query features, improving feature discriminability and generalization under few-shot settings. (3) We build a unified meta-learning detection framework that integrates EFFM and NLFM in both proposal and classification stages. Experiments on NWPU VHR-10, iSAID and DIOR demonstrate significant performance gains over state-of-the-art methods.
2. Materials and Methods
In this section, we provide a detailed description of our meta-learning-based few-shot object detection method, which is designed based on the Faster R-CNN [
23] framework. The core components of our model include the backbone, RPN, EFFM, NLFM, and RCNN head as shown in
Figure 1. EFFM is designed to embed class-specific information from support features into query features, enhancing the model’s ability to detect novel objects. It is composed of MFSFEM and FAFM. MFSFEM extracts fine-grained support features at multiple scales (P3–P5) using learnable query vectors and cross-attention. FAFM then fuses these features into the corresponding query feature maps based on similarity, producing support-aware query representations. This feature-level integration guides the RPN to generate proposals more closely aligned with support instances. NLFM enhances the interaction between RoI features and support features through a more expressive fusion strategy. By replacing traditional simple fusion methods, NLFM improves feature alignment and boosts the model’s ability to recognize novel objects in complex remote sensing scenes.
2.1. EFFM
Few-shot object detection based on meta-learning follows a dual-branch framework [
26], where both support and query sets are used during training and testing. The Enhanced Feature Fusion Module is a key component designed to bridge these two branches and consists of two parts: the Multi-scale Fine-grained Support Feature Extraction Module and the Feature Allocation Fusion Mechanism. The first module extracts fine-grained, multi-scale semantic features from support instances, while the second module integrates these features into the query image representation. Compared to using raw query features alone, this fusion allows the RPN to generate proposals more effectively by focusing on regions relevant to the support instances, rather than searching blindly across the entire query image.
2.1.1. Multi-Scale Fine-Grained Support Feature Extraction Module
We follow the FPN [
27] architecture to obtain multi-scale feature maps, which helps address the large scale variation commonly found in remote sensing imagery. FPN is a feature fusion structure that combines top-down and lateral connections, allowing high-level semantic information extracted from the backbone to be propagated to lower-resolution feature maps. While the original FPN produces feature maps from P2 to P5, in our design we utilize only P3 to P5 for subsequent processing.
As shown in the
Figure 2, the Multi-scale Fine-grained Support Feature Extraction Module takes the support instance and extracts its multi-scale features through the backbone and FPN, resulting in feature maps at levels P2 to P5. To capture fine-grained details at different scales, we introduce level feature queries—three learnable query vectors assigned to each support class during training. These vectors correspond to the P3, P4, and P5 feature maps and are used to extract fine-grained semantic representations from each scale. During training, the query vectors are progressively refined to specialize in capturing the most informative features at their respective levels. Similar to the object queries used in DETR [
28], we employ a cross-attention mechanism [
20] to extract features from the P3, P4, and P5 feature maps. Each level feature query attends to its corresponding feature map, allowing the model to selectively focus on informative regions and obtain fine-grained support representations at multiple scales.
The fine-grained support feature extraction process is performed separately at multiple scales using a cross-attention mechanism between the level feature queries and the corresponding support feature maps. For a given level
, let the support feature map be denoted as:
where
and
are the height and width of the feature map at level
l, and
d is the channel dimension. We flatten the spatial dimensions, yielding:
each level is assigned a learnable level feature query vector:
To compute the attention between the query and the support feature map, we first project the support features to the same latent space as the query using a linear transformation:
where
is a learnable weight matrix. We then compute the attention weights:
the softmax is applied along the spatial dimension
, and the resulting attention weights are used to aggregate the support features:
here,
is the fine-grained support feature vector at level
l, which captures the most relevant information from the spatial regions of the support feature map in relation to the level feature query.
During training, the level feature queries are optimized to adaptively focus on discriminative regions across scales, enabling the extraction of rich semantic cues from support instances. These vectors serve as learnable class-conditioned extractors that help encode subtle visual patterns critical for few-shot recognition in remote sensing scenarios.
2.1.2. Feature Allocation Fusion Mechanism
To integrate support features into the query representation, we design FAFM based on cross-attention. The fusion is performed separately for each FPN level
. Taking level
as an example, as shown in the
Figure 3, each class in the support set is represented by a fine-grained support feature vector extracted at this level, denoted as:
where
c is the number of support classes, and
d is the channel dimension. Let the query feature map at level
be flattened as:
To compute similarity between query features and class-level support features, we project both into a shared latent space using a linear layer
:
here, we use a shared projection matrix
for both query and support features in order to reduce the number of parameters and improve training efficiency, especially important under the low-data regime of few-shot learning. Then we compute the cross-attention weights:
the support features are weighted and fused into the query feature map:
where
is a hyperparameter that controls the strength of feature fusion. In our implementation,
is treated as a manually tunable scalar rather than a learnable parameter, allowing for flexible adjustment based on validation performance or prior knowledge.
This process injects class-aware semantic information into the query features at each scale, allowing the model to focus more effectively on support-relevant regions. This performs a class-conditioned projection of the query toward the subspace spanned by the support features at the same scale, suppressing background and off-class responses. It yields better-aligned, more discriminative maps (lower within-class variance, larger between-class margins), while the residual path preserves training stability.
By applying FAFM across levels , , and , the resulting query feature maps become semantically aligned with the support set. This cross-image feature guidance enhances the performance of RPN, leading to more accurate candidate regions that are conditioned on the target support classes.
2.2. NLFM
To enhance the model’s ability to discriminate between complex object features in remote sensing images, we design Nonlinear Fusion Module to effectively integrate support information into the RoI features.
After proposal generation by RPN, each query RoI is aligned using RoI Align, followed by spatial pooling to produce a fixed-size feature vector, denoted as , where d is the channel dimension.
Meanwhile, for each class in the support set, we extract class-specific semantic representations by applying global average pooling to the P5-level feature maps. For a support image belonging to class i, the resulting support vector is denoted as .
To better address the challenges of high intra-class variation and limited training data in few-shot detection, we extend the traditional feature fusion strategies by introducing a Nonlinear fusion module as shown in
Figure 4. Unlike prior methods that rely on simple element-wise multiplication to combine RoI and support features, our approach applies a richer set of transformations to enhance semantic alignment and category specificity.
For a given RoI feature
and a class-specific support vector
, we construct the fused representation
as follows:
here, ⊙ denotes element-wise multiplication, and each
represents a nonlinear transformation function defined as:
where
and
are learnable parameters for each transformation branch, LN denotes layer normalization, and ReLU [
29] introduces nonlinearity.
This formulation enables the model to simultaneously learn class-aware interactions from the multiplicative term , emphasize discriminative differences through the subtraction term , and preserve spatial and contextual cues via the transformed original RoI feature. These three complementary signals contribute to a more robust and expressive fused representation under limited supervision.
The resulting vector
is then forwarded to the RCNN [
23] classification and regression heads for category prediction and bounding box refinement.
Through this nonlinear and learnable design, NLFM enhances the model’s ability to extract discriminative features under limited supervision—especially when detecting visually similar objects or small targets in complex scenes.
2.3. Loss Function
The overall loss function of our model is inherited from Meta R-CNN [
13], consisting of three components: RPN loss, the RCNN detection loss, and a meta-learning-based loss that enhances the discriminability of support features.
The total loss is defined as:
where
denotes the objectness classification and bounding box regression loss from the Region Proposal Network,
represents the standard classification and regression losses computed from the fused RoI features, and
is a meta-level loss designed to enhance the discriminability of support vectors by minimizing intra-class variance and maximizing inter-class separation. The coefficient
is an empirically chosen hyperparameter that controls the relative importance of the meta loss with respect to the standard detection losses.
The RPN loss [
30],
, is composed of two parts: a binary cross-entropy classification loss [
31] for objectness prediction and a Smooth
loss [
32] for bounding box regression. It is defined as:
where
is the predicted objectness score for anchor
i, and
is the ground-truth label. BCE stands for the binary cross entropy function, and
denotes the smooth L1 loss. The predicted and ground-truth bounding box regression targets are denoted by
and
, respectively. The indicator function
ensures that the regression loss is only applied to positive anchors.
and
are the total numbers of anchors used for normalization.
The RCNN loss,
, includes multi-class classification and bounding box regression on the fused RoI features. It is formulated as:
where
denotes the predicted class probabilities for the
j-th RoI,
is the ground truth class label,
and
are the predicted and true bounding box parameters respectively, CE stands for the cross entropy loss, and
denotes the smooth L1 loss.
N is the total number of RoIs in the mini-batch used for classification loss normalization, and
is the number of positive RoIs used for bounding box regression loss normalization.
The meta loss,
, is introduced to explicitly enforce discriminative support feature representations. Each support feature vector
is passed through a linear classifier
to produce logits:
which are then compared to the one-hot encoded ground truth label
using cross-entropy loss:
Together, these losses jointly optimize the model, ensuring accurate object proposals, precise detection, and robust, discriminative support feature learning for few-shot detection scenarios.
2.4. Evaluation Criteria
To comprehensively evaluate the performance of our proposed few-shot object detection model in remote sensing imagery, we adopt a set of standard evaluation metrics widely used in the object detection literature. These metrics assess detection accuracy, localization precision, and the robustness of the model under limited supervision.
- (1)
Intersection over Union (IoU) [
33]: IoU is used to determine whether a predicted bounding box correctly matches a ground truth object.
- (2)
Average Precision (AP) [
34]: Average Precision (AP) measures the area under the precision-recall curve for each class
c, and is defined as:
where
denotes the precision at recall level
r for class
c.
- (3)
Precision, Recall, and F1-Score: These metrics provide detailed insight into the detection performance.
- (4)
Proposal Recall (Recall@N): To evaluate the quality of region proposals generated by the Region Proposal Network (RPN), we adopt the Recall@N metric, which quantifies the ability of the top-
N proposals to cover ground-truth objects with sufficient overlap. It is defined as:
where
is the set of ground-truth bounding boxes with total number
,
denotes the top-
N proposals generated by the RPN,
represents the Intersection-over-Union between ground-truth box
g and proposal
p,
is the IoU threshold (typically set to 0.5), and
is the indicator function that equals 1 if the condition holds, and 0 otherwise.
- (5)
Inference Efficiency: We also report inference time per image and model parameter size to assess practical deployment viability, especially for large-scale remote sensing tasks.
3. Results
In this section, we provide a detailed overview of our experimental results. We begin by introducing the dataset utilized in our study, highlighting its main features and outlining the evaluation metrics applied. Following this, we showcase the performance of our meta-learning based few-shot object detection approach on these datasets, benchmarking it against leading state-of-the-art methods. Additionally, to assess the individual contributions of each component within our model, we conduct ablation experiments that quantitatively measure their impact on the overall detection effectiveness.
3.1. Data Set
We conducted experiments on three remote sensing image target detection data- sets, including
- (1)
NWPU VHR-10 [
35]: The NWPU VHR-10 dataset is a widely used benchmark for object detection in very high-resolution remote sensing images. It contains 800 optical images acquired from various urban and rural areas, with spatial resolutions ranging from 0.5 to 2 m. The dataset covers 10 common object categories such as airplanes, ships, storage tanks, baseball diamonds, and vehicles. Each object is annotated with horizontal bounding boxes, providing accurate localization information for detection tasks. NWPU VHR-10 poses challenges including significant variations in object scale, orientation, and complex backgrounds, which reflect real-world scenarios in aerial imagery analysis. Due to its diversity and moderate size, NWPU VHR-10 serves as a valuable benchmark for evaluating and comparing object detection algorithms designed for remote sensing imagery.
- (2)
iSAID [
36]: In remote sensing object detection, iSAID serves as a large, challenging benchmark with 2806 high-resolution images and 655,451 annotated objects spanning 15 common categories. It exhibits dense layouts, strong scale variation, arbitrary orientations, class imbalance, and cluttered backgrounds. Many targets are tiny and visually ambiguous, so effective detectors must leverage multi-scale features and context to localize them reliably. These characteristics make iSAID well suited for evaluating few-shot methods, stressing robustness to small objects, crowding, and appearance diversity without dataset-specific tuning.
- (3)
DIOR [
37]: The DIOR Dataset (Dataset for Object Inspection in Remote Sensing Images) is a large-scale publicly available dataset specifically designed for object detection in remote sensing imagery. It contains over 23,000 images captured from diverse geographic regions, covering both urban and rural environments to ensure a wide range of scenarios and conditions. The dataset comprises 20 object categories frequently encountered in aerial and satellite images, including airplanes, ships, vehicles, storage tanks, and more. Each object instance is annotated with oriented bounding boxes, which precisely capture the object’s orientation and shape, reflecting the variability inherent to remote sensing perspectives. DIOR presents significant challenges such as substantial scale variations, dense and cluttered backgrounds, and high intra-class diversity, making it a comprehensive benchmark for evaluating the robustness and accuracy of object detection methods in remote sensing applications.
3.2. Experimental Setting
Following common conventions adopted by numerous state-of-the-art few-shot object detection methods, we divide the NWPU VHR-10, iSAID and DIOR datasets into base and novel categories to evaluate the generalization capability of our model. The base classes are used for meta-training, where the model learns general object detection knowledge, while the novel classes are reserved for meta-testing to assess few-shot adaptation. This split ensures that the model is tested on categories it has never seen during training, aligning with the standard protocol for few-shot object detection benchmarks. The category split is summarized in the tables below.
Table 1,
Table 2 and
Table 3 show the specific category split. For a given split, the classes listed under Novel are held out during meta base training, while the remaining classes (base classes) are used to train the detector. We only use novel classes in meta fine tuning phase. Different splits are used to reduce category selection bias.
In our experiments, we adopt a ResNet-101 backbone with FPN structure for image feature extraction. The backbone is initialized with weights pre-trained on the ImageNet dataset, which provides strong low-level and high-level feature representations for downstream detection tasks. This configuration helps the model to effectively capture the scale variance and semantic diversity of remote sensing targets.
We implement our model using the MMFewShot [
38] detection framework, which offers modular, reproducible, and scalable implementations for few-shot object detection. The training and inference processes are conducted on a single NVIDIA GeForce RTX 4090 GPU with 24 GB of memory.
To mitigate overfitting and enhance model generalization in the small-shot regime, we employ a comprehensive data augmentation strategy during both meta-training and meta-fine-tuning to improve model generalization. This strategy includes random horizontal and vertical flipping and random rotation to simulate variations in object orientation and viewpoint. To increase robustness against changes in illumination and sensor characteristics, we apply mild random color jittering to brightness, contrast, and saturation. Furthermore, random scaling followed by cropping to the original input size is used to enhance the model’s invariance to object scale. All transformations are applied independently to each image in an episode, with geometric transformations applied consistently to both the image and its corresponding bounding box annotations for support samples. This pipeline significantly increases the effective diversity of the limited training data, helping the model learn more robust and generalizable features.
During the meta-training phase, we train the model for a total of 20,000 iterations using episodic training with a batch size of 15 tasks per iteration. We use the SGD optimizer, setting the initial learning rate to 0.0005, the momentum coefficient to 0.9, and a weight decay of 0.0001 to mitigate overfitting.
In the meta-testing phase, the model is adapted using 300 iterations with a reduced learning rate of 0.0001 to ensure stable learning under the few-shot setting. This two-stage learning framework enables the model to generalize well to novel object categories with limited training samples, while retaining discriminative capabilities on base classes.
3.3. Results on NWPU VHR-10 Dataset
In this subsection, we present the few-shot detection results on the NWPU VHR-10 [
35] dataset under different shot settings (3-shot, 5-shot, 10-shot, and 20-shot) across two splits. We compare our proposed method with several state-of-the-art approaches including TFA [
16], MetaRCNN [
13], PCNN [
15], FsDetView [
14], G-FSDet [
19], FCT [
39] and FS-DETR [
40]. The evaluation metric used is nAP, which better captures performance on novel categories in few-shot settings. As shown in
Table 4, our method consistently outperforms the baselines across all settings, particularly in low-shot regimes, demonstrating superior generalization to novel categories.
Across both splits, our method leads in most shot settings, and the trends align with how each approach fuses support information. TFA, which relies on minimal fine-tuning without explicit support conditioning, lags far behind in all regimes. Meta R-CNN and FsDetView inject support only at the RoI stage and predominantly through linear modulation or element-wise mixing; this late, linear fusion is brittle for look-alike categories in remote sensing. PCNN improves proposals via a single prototype, but this under-represents the strong intra-class diversity typical of aerial targets. G-FSDet offers stronger proposal priors, yet still lacks the combination of early, class-conditioned, multi-scale guidance and expressive RoI fusion. In contrast, our pipeline first uses EFFM to steer the RPN toward class-relevant regions across scales and then applies NLFM to preserve complementary views before mixing. This two-stage alignment explains the data: in the most challenging 3-shot setting we improve over the strongest baseline by about 9 points on Split 1 and 8 points on Split 2; at 10-shot we are on par with the best method on Split 1 but hold a double-digit margin on Split 2; and at 20-shot we retain a clear lead on both splits. These results indicate that early support-aware proposals plus non-linear RoI fusion reduce within-class variance and increase between-class separability, yielding consistent gains from low to high shot regimes.
The visual comparison below illustrates the detection results of our model alongside G-FSOD [
19] and P-CNN [
15]. G-FSOD [
19] fails to detect the novel class tennis court, while P-CNN [
15] misses several instances of the small object airplane. Both models exhibit missed detections for the small object class vehicle. In contrast, our model demonstrates superior detection performance, successfully identifying both base and novel classes with higher accuracy and completeness.
Figure 5 illustrates two recurring effects that are consistent with our design and with the quantitative results in
Table 4. First, our detector tends to recover small and densely distributed targets (e.g., ships/vehicles) that competing methods miss; this matches EFFM’s role of injecting class-conditioned, multi-scale information into the proposal stage and explains the larger margins at 3-shot on both splits and the clear lead at 20-shot. Second, compared with competing methods, our predictions exhibit tighter localization and higher-confidence true positives under the same NMS settings. Boxes from our model align more closely with object extents, and correct detections appear with stronger confidence, whereas baselines more often yield looser boxes and borderline scores. This is consistent with NLFM’s multi-branch, non-linear RoI fusion which produces more discriminative features and better confidence calibration and complements EFFM’s proposal improvements, aligning with the nAP trends in
Table 4.
Table 5 presents the per-class detection accuracy of our proposed model on the NWPU VHR-10 [
35] dataset under 3-shot, 5-shot, 10-shot, and 20-shot settings, averaged over three independent runs with random sampling. The results show that detection accuracy generally improves as the number of shots increases, indicating the model’s ability to effectively learn from limited annotations. Notably, even in low-shot scenarios, the model achieves strong performance on classes such as airplane, ship, and harbor, which tend to have consistent shapes and distinct visual cues. In contrast, categories like vehicle and bridge, which exhibit higher intra-class variability, smaller object sizes, or more complex backgrounds, remain more challenging. However, as more training examples are introduced, the model is better able to capture these variations and enhance its detection performance across both base and novel classes, highlighting its robust generalization ability in few-shot detection tasks.
3.4. Results on iSAID Dataset
In this subsection, we report model’s few-shot detection performance on iSAID dataset across two evaluation splits. Compared with NWPU VHR-10, iSAID poses greater difficulties since scenes are more diverse and cluttered, objects exhibit stronger scale variation and orientation changes. Despite this increased complexity, our method attains the best performance across all reported splits and shot settings on iSAID.
From the
Table 6, novel AP rises with the number of shots for all methods, with the most pronounced gain typically from 5 to 10 shots and a smaller but consistent improvement from 10 to 20 shots. Our method attains the best result in every split configuration. In Split 1, the advantage over the strongest baseline (G-FSDet) remains stable at roughly six to seven points across 3, 5, 10, and 20 shots, indicating steady benefits as supervision increases. In Split 2, the margin widens as the shot count increases, indicating that additional exemplars help our method handle greater scene and appearance variability more effectively.
The
Figure 6 above shows the visual comparison between our model and other baseline. In the visual comparison, both G-FSDet and P-CNN struggle on small targets. Neither of them reliably detects the ship instances. G-FSDet further produces a false positive on the harbor class, while P-CNN entirely misses tennis court and harbor. For the objects that are detected, both baselines exhibit boxes that deviate noticeably from the ground truth. By contrast, our model recovers all objects present, places tighter boxes that closely match object extents, and assigns higher confidence to true positives, reflecting more reliable localization and scoring.
3.5. Results on DIOR Dataset
A performance comparison between our method and other advanced few-shot object detectors on the DIOR [
37] dataset is presented in
Table 7. Compared to the NWPU VHR-10 and iSAID dataset, the performance on DIOR is generally lower for all methods due to its larger category set, more complex scenes, and higher intra-class variance. Despite these challenges, our model consistently achieves the best results across all four data splits and shot settings, confirming its strong generalization ability.
On DIOR, the split-wise trends mirror how each method handles support information and clutter objects. On Split 1, where most approaches improve with more shots, our performance keeps pace with the strongest baseline at 3-shot and then pulls clear as shots increase, indicating that our RoI fusion continues to function rather than plateauing. On Split 2, which mixes look-alike categories and textured backgrounds, our model leads other competitors indicating early, class-conditioned, multi-scale guidance recovers small and dense targets while the non-linear RoI fusion sharpens localization and confidence in ambiguous scenes. On Split 3, the gap is the largest: as the number of shots increases, our performance keeps improving while the stronger baselines level off. This suggests that methods adding support only at the RoI stage with mostly linear fusion (Meta R-CNN, FsDetView) or using a single prototype (PCNN) do not capture the variation within DIOR classes. On the more difficult Split 4, our method also stays ahead at all shot counts. By contrast, TFA, which fine-tunes without explicit support information, trails in every setting.
When compared to the NWPU VHR-10 [
35] dataset results, our model demonstrates more stable improvements on DIOR [
37] across all splits, particularly under lower-shot conditions. This suggests that the design of our meta-learning framework including support-guided proposal enhancement and feature fusion not only improves detection accuracy but also enhances robustness in highly diverse scenarios.
Overall, these results on DIOR [
37] confirm that our method provides superior performance and generalizability across different remote sensing benchmarks, especially when data is scarce and category distributions are more complex.
The visualization results of our model compared with P-CNN [
15] and G-FSDet [
19] on the DIOR dataset under the 10-shot setting are shown in
Figure 7. From the comparison, we observe that P-CNN [
15] fails to detect the bridge category entirely, while both P-CNN [
15] and G-FSDet [
19] struggle with small objects such as ship, either missing them or localizing them imprecisely. Furthermore, G-FSDet [
19] fails to detect ground track field, and although P-CNN [
15] manages to produce detections for airport and ground track field, the results suffer from low accuracy and confidence.
These issues largely stem from the models’ limited ability to capture class-specific cues from only a few examples, especially when faced with visually complex backgrounds or subtle object appearances. In contrast, our method shows clear improvements: bridge is correctly detected despite its variable structure; ship and other small objects are localized more precisely; and ground track field and airport are recognized with stronger spatial accuracy. These advantages arise from how our model enhances the interaction between the support examples and query image by allowing the model to focus more on category-relevant regions and suppress irrelevant distractions. This targeted focus is especially beneficial for novel categories with ambiguous boundaries or small sizes, explaining our model’s consistently better visual performance.
Table 8 presents the accuracy of our proposed method for each category in the DIOR [
37] dataset. The results are averaged over three independent runs with random sampling and cover both base and novel classes under 3-shot, 5-shot, 10-shot, and 20-shot settings. Among the base classes, categories like Airplane, Windmill, and Basketball court achieve consistently high accuracy across all shot settings, indicating the model’s ability to generalize well when sufficient visual consistency exists. In contrast, categories such as Bridge and Harbor show relatively lower performance, likely due to structural ambiguity and background clutter in remote sensing imagery.
For novel classes, performance is understandably lower due to limited training samples. Nonetheless, the accuracy of classes like Tennis court improves significantly with more shots, reaching 80.13% in the 10-shot setting. On the other hand, Dam and Vehicle remain challenging, especially under low-shot conditions, suggesting their visual variability and small object size pose difficulties for few-shot generalization. Overall, the table highlights the model’s strong performance on a wide range of categories and illustrates both its strengths and the remaining challenges in handling hard-to-detect novel objects.
4. Discussion
To comprehensively assess the effectiveness of our proposed few-shot object detection framework, this section presents an in-depth discussion of its key design choices, sensitivity to hyperparameters, and computational efficiency. We begin with ablation studies to isolate and evaluate the contributions of core modules, particularly EFFM and NLFM, in order to understand how each component influences the detection performance across different shot settings. We then analyze the impact of varying the meta loss weight to determine the most effective configuration and to assess the model’s robustness to this hyperparameter. Finally, we compare our model’s parameter size and inference speed against other state-of-the-art few-shot detectors to illustrate its practical trade-off between accuracy and computational complexity. These analyses together provide a holistic understanding of the model’s strengths, limitations, and applicability in real-world scenarios.
4.1. Ablation Study
To further evaluate the effectiveness of EFFM, we conduct an ablation experiment focusing on the quality of region proposals generated by the RPN. Specifically, we compare the Recall@N performance of the baseline RPN and RPN augmented with EFFM under different few-shot settings (3, 5, 10, and 20 shots). The metric Recall@N measures the percentage of ground-truth objects that are covered by the top-N proposals (with IoU ≥ 0.5), providing insight into the proposal network’s ability to recall relevant regions for downstream detection.
As shown in
Table 9, introducing EFFM consistently improves RPN recall across all values of N and shot settings. For instance, in the 3-shot setting, Recall@100 improves from 42.1% to 55.7%, while Recall@1000 improves from 71.0% to 81.6%, indicating a significant enhancement in proposal coverage despite the scarcity of training data. This trend continues as the number of shots increases: under the 20-shot setting, the RPN with EFFM achieves 67.6% Recall@100 and 90.5% Recall@1000, outperforming the baseline RPN by large margins.
These improvements can be attributed to the ability of EFFM to incorporate support-aware contextual information into the proposal generation process. By fusing fine-grained support features into the query branch at an early stage, EFFM strengthens the RPN’s sensitivity to class-relevant regions, leading to higher-quality proposals even with limited annotations.
To visualize this trend more clearly, we plot the Recall@N curves under each shot setting in
Figure 8. The curves consistently show that the RPN with EFFM achieves higher recall across all N values compared to the baseline. Notably, the performance gap is more pronounced in lower shot scenarios, which demonstrates that EFFM is particularly effective in boosting region proposal quality under extreme data scarcity—one of the core challenges in few-shot object detection.
These results confirm that the EFFM module significantly enhances the RPN’s ability to propose relevant candidate regions, laying a stronger foundation for downstream classification and regression stages.
To assess why the proposed NLFM is needed beyond simple RoI–support combinations, we ablate its components at 10-shot on NWPU VHR-10. The results are presented in
Table 10. The baseline uses only the concatenation path. Adding an element-wise multiplication branch raises nAP from 75.28 to 77.75, indicating that explicit class-consistent matching helps. Introducing the subtraction branch yields a further increase to 78.31, showing that the similarity and difference paths are complementary. The largest gain appears when we apply the lightweight non-linear projection after concatenation of the three branches, reaching 81.64. This suggests that merely stacking linear operators underuses the complementary signals, and non-linearity is important to mix and reweight branch information, producing more discriminative RoI features under few-shot conditions.
To assess the individual and combined contributions of EFFM and NLFM, we conduct ablation studies under different few-shot settings on the NWPU VHR-10 dataset. The results are presented in
Table 11.
The baseline model without EFFM or NLFM achieves relatively modest performance, with a 3-shot nAP of 54.32% and a 20-shot nAP of 76.50%. Introducing EFFM alone results in a substantial performance boost across all settings, with improvements of +4.61%, +4.76%, +5.42%, and +6.77% in 3-shot through 20-shot respectively. This demonstrates the strong ability of EFFM to incorporate class-specific support cues into the proposal generation and feature extraction pipeline, especially in low-data regimes.
When only NLFM is enabled, we observe notable but slightly smaller gains compared to EFFM: improvements of +3.33% in 3-shot and +5.91% in 20-shot. This suggests that NLFM enhances the model’s representation power through non-linear transformations, helping better distinguish subtle differences between object instances with limited examples.
When both EFFM and NLFM are enabled, the model achieves the highest nAP across all shot settings, reaching 61.46% (3-shot) and 87.64% (20-shot). This indicates that EFFM and NLFM are complementary: while EFFM improves support–query alignment at the feature fusion stage, NLFM boosts the discriminability of feature representations at a deeper semantic level. Their joint effect leads to consistently superior performance, verifying the effectiveness of the full architecture.
Moreover, the gain from combining the two modules is more pronounced in lower-shot settings, where data scarcity amplifies the need for effective support-based guidance and feature expressiveness. This reinforces our model’s design motivation—to maximize information utilization and enhance generalization in data-limited detection scenarios.
We also visualize how EFFM reshapes the query features in
Figure 9. Before EFFM, the responses are already object-dominated, but they spill into surrounding background and nearby structures are activated, blurring object boundaries. After applying EFFM’s class-conditioned cross-attention, these spurious background activations are substantially reduced while the peaks over true objects are preserved and sharpened.
4.2. Hyperparameter and
To further investigate the influence of key hyperparameters in our model, we conduct controlled ablation experiments on the weight coefficient of the meta loss and the interpolation factor used in the support–query fusion process. Specifically, balances the contribution of the meta-level supervision to the overall optimization, and thus plays a crucial role in enforcing discriminative support feature representations. On the other hand, controls the extent to which the fused query features retain information from the original query and the adapted support cues. Proper tuning of these hyperparameters is essential to achieving optimal few-shot detection performance. In the following experiments, we systematically vary and to analyze their individual effects under different shot settings.
Table 12 presents the impact of varying the fusion hyperparameter
on few-shot detection performance across different shot settings. The parameter
controls the balance between the original query feature and the support-enhanced feature in the fusion stage. As observed from the results, a moderate value of
consistently yields the best performance under all shot conditions, achieving the highest nAP scores.
When is set too low, the fused features rely heavily on the original query features and lack sufficient support-driven adaptation, which limits performance, especially under low-shot conditions. Conversely, when approaches 1.0, the model over-relies on support information, which may introduce noise or reduce robustness due to the variability and sparsity of support examples. This results in a performance drop, particularly for 3-shot and 5-shot cases.
These results suggest that properly tuning the fusion balance is crucial: too little support information leads to under-utilization of guidance, while too much causes overfitting or loss of discriminative query structure. The optimal value () effectively captures support-based guidance while preserving the structural integrity of the query features, thereby enhancing few-shot generalization.
We also experimented with making a learnable parameter initialized at 0.6, allowing it to adapt during training. While this approach achieved comparable final performance, it introduced training instability in early epochs and increased convergence time by approximately 15%. Additionally, the learned values varied significantly across different category splits, reducing reproducibility. The fixed provided more predictable behavior and faster convergence without sacrificing performance.
The ablation study in
Table 13 investigates the influence of the meta loss weight
on few-shot detection performance. When
, i.e., the meta loss is not applied, the model exhibits significantly lower performance across all shot settings. This demonstrates the critical role of the meta-learning objective in enhancing the generalization ability of the model to novel classes.
As increases from 0 to 0.05, the detection accuracy steadily improves, with the best results achieved at . This suggests that a moderate contribution from the meta loss effectively guides the support feature representation to become more discriminative, thus benefiting downstream object detection.
However, further increasing to 0.1 and 0.2 leads to a slight performance drop. This is likely due to the model overemphasizing the meta loss at the expense of the primary detection objectives, resulting in suboptimal training dynamics. These results indicate that careful tuning of is essential, and that setting offers the best trade-off between classification and regression objectives and meta-level supervision.
To evaluate the sensitivity and robustness of our hyperparameter choices, we systematically varied and around their selected values (, ). Performance remained stable within ±0.2 variations, with nAP fluctuations within ±1.5% across all shot settings on NWPU VHR-10, indicating low sensitivity to minor tuning. The same hyperparameters can be effectively transferred to iSAID and DIOR without adjustment, preserving competitive performance and demonstrating cross-dataset robustness.
4.3. Analysis of Failure Cases
Figure 10 presents three representative failure cases from our experiments on the NWPU dataset under the 10-shot setting. In Case 1, two small ship instances are missed in a dense harbor scene. The red circles highlight these undetected ships. Case 2 shows a missed small airplane that exhibits low contrast against the runway background. Case 3 demonstrates an interesting failure pattern where, among three bridge instances present in the image, only the curved bridge with atypical geometry is missed while the two normal-shaped bridges are correctly detected.
These failure patterns can be analyzed in relation to our model’s architectural design and the inherent challenges of few-shot remote sensing object detection. The missed small targets in Cases 1 and 2 primarily stem from limitations in our MFSFEM. Although MFSFEM extracts support features at multiple scales, extremely small objects occupying only a few pixels in the feature maps may not generate sufficiently strong activations for the level feature queries to capture. This is particularly challenging when such small objects appear in cluttered backgrounds, as the attention mechanism may prioritize more prominent visual features. Additionally, FAFM may allocate insufficient weight to these subtle patterns when integrating support information into query features, especially when the support examples do not adequately represent the full range of scale variations.
Case 3 reveals a different limitation related to our model’s ability to handle significant intra-class shape variations. While our NLFM effectively captures semantic similarity and difference between support and query features, it may struggle with extreme geometric transformations not well-represented in the few support examples. The curved bridge represents an atypical instance that differs substantially from the support examples used during meta training. Our model’s dependence on support prototypes for feature extraction and fusion means that objects deviating significantly from these prototypes may not be adequately recognized.
These failure cases highlight several directions for future improvement. First, enhancing the attention mechanisms in MFSFEM to better capture subtle features of extremely small objects could improve detection of targets like the missed ships and airplane. Second, incorporating more diverse support examples during training, possibly through data augmentation strategies that simulate various object scales and shapes, could increase the model’s robustness to intra-class variations. Third, exploring adaptive thresholding mechanisms in the RPN and classification heads could help recover borderline detections that currently fall below the confidence threshold.
4.4. Complexity and Inference Time
In few-shot object detection, especially in remote sensing applications, it is critical not only to achieve high detection accuracy but also to maintain computational efficiency for practical deployment. Therefore, we evaluate our model’s complexity and inference speed, comparing it with several state-of-the-art few-shot detectors to understand the trade-offs between accuracy and efficiency.
Table 14 reports the number of parameters, inference speed (frames per second, FPS), and inference time per image measured on an NVIDIA RTX 4090 GPU for our method and several competitive baselines.
From the results, our model maintains a moderate parameter size compared to other methods, slightly larger than P-CNN [
15] and TFA [
16] but significantly smaller than FsDetView [
14]. The inclusion of EFFM and NLFM introduces additional computation; however, the inference speed remains competitive at 15.5 FPS, corresponding to an average inference time of 64.5 ms per image.
While G-FSDet [
19] achieves the fastest inference speed at 21 FPS, it comes at the cost of a larger model size. In contrast, P-CNN [
15] and TFA [
16] have fewer parameters but suffer from notably slower inference speeds, limiting their practical applicability in scenarios requiring real-time or near-real-time detection.
We measure the average inference time per query image on the NWPU VHR-10 dataset using a single NVIDIA RTX 4090 GPU, varying K from 1 to 20. The inference time scales sub-linearly with the support set size, as shown in
Table 15. This favorable efficiency stems from the synergistic effect of support feature caching and linear-complexity fusion operations. Once extracted, support features are cached and reused across multiple query images, while the EFFM and NLFM modules are designed with linear computational complexity relative to K. Consequently, the model maintains practical inference latency even with larger support sets, enabling the use of richer contextual information without prohibitive computational burden.
Our model strikes a good balance between accuracy and computational efficiency, delivering faster inference than many baselines while maintaining a manageable model size. This balance is crucial for deploying few-shot object detectors in remote sensing systems where both precision and timely response are essential.
In summary, despite the increased complexity introduced by our model’s novel modules, the achieved inference speed and parameter efficiency demonstrate its suitability for real-world applications with reasonable computational resources.
5. Conclusions
In this work, we have proposed a novel meta-learning-based few-shot object detection framework tailored for remote sensing imagery. By introducing EFFM and NLFM, our method effectively leverages support set information to enhance the representation and discrimination of query features. Moreover, the incorporation of a meta-level loss further enforces discriminative support features, improving the overall detection robustness under extremely limited annotated data scenarios.
Extensive experiments on benchmark remote sensing datasets, including NWPU VHR-10, iSAID and DIOR, demonstrate that our approach consistently outperforms state-of-the-art few-shot detection methods across various shot settings. Our ablation studies confirm the individual contributions of each proposed module, highlighting their complementary benefits. Additionally, we provide comprehensive analyses of model complexity and inference speed, verifying that our method achieves a favorable balance between accuracy and computational efficiency.
However, despite these promising results, our model still has some limitations. The Enhanced Feature Fusion Module introduces additional computational overhead due to its internal attention mechanisms, which can limit real-time deployment in resource-constrained environments. Furthermore, while the meta-level loss improves feature discriminability, tuning its weighting hyperparameter requires careful empirical validation, which may restrict the model’s adaptability across diverse datasets. Lastly, our framework currently focuses on single-image detection and does not address temporal or multi-view information that could further improve robustness in remote sensing applications.
Future work will focus on optimizing the computational efficiency of the feature fusion modules, exploring adaptive hyperparameter tuning strategies, and extending the framework to incorporate spatiotemporal cues and multi-sensor fusion. These directions aim to enhance both the practicality and performance of few-shot remote sensing object detection in real-world scenarios.