MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection

Qi, Haoxiang; Zhao, Wenzhe; Zhang, Ting; Zhou, Guangyao

doi:10.3390/app16020917

Open AccessArticle

MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection

by

Haoxiang Qi

^1,2,3,4,

Wenzhe Zhao

^1,2,3,4,*,

Ting Zhang

^1,2,3 and

Guangyao Zhou

^1,2,3

¹

The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China

³

Key Laboratory of Target Cognition and Application Technology (TCAT), Chinese Academy of Sciences, Beijing 100190, China

⁴

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 917; https://doi.org/10.3390/app16020917

Submission received: 27 November 2025 / Revised: 22 December 2025 / Accepted: 12 January 2026 / Published: 15 January 2026

(This article belongs to the Special Issue AI in Object Detection)

Download

Browse Figures

Versions Notes

Abstract

Few-shot object detection in remote sensing imagery faces significant challenges, including limited labeled samples, complex scene backgrounds, and subtle inter-class differences. To tackle these issues, we design a novel detection framework that effectively transfers supervision from a few annotated support examples to the query domain. We introduce a feature enhancement mechanism that injects fine-grained support cues into the query representation, helping the model focus on relevant regions and suppress background noise. This allows the model to generate more accurate proposals and perform robust classification, especially for visually confusing or small objects. Additionally, our method enhances feature interaction between support and query images through a nonlinear combination strategy, which captures both semantic similarity and discriminative differences. The proposed framework is fully end-to-end and jointly optimizes the feature fusion and detection processes. Experiments on three challenging benchmarks, NWPU VHR-10, iSAID and DIOR, demonstrate that our method consistently achieves state-of-the-art results under different few-shot settings and category splits. Compared with other advanced methods, it yields superior performance, highlighting its strong generalization ability in low-data remote sensing scenarios.

Keywords:

object detection; few shot learning; remote sensing

1. Introduction

Object detection in remote sensing imagery aims to identify and localize targets such as airplanes, ships, vehicles, and buildings from aerial or satellite images [1]. This task plays a critical role in various real-world applications, including military surveillance, urban planning, disaster monitoring, and environmental assessment. Compared to natural images, remote sensing images present unique challenges such as large variations in scale, complex backgrounds, high object density, and significant inter-class similarity [2,3]. While deep learning-based detectors, especially convolutional neural networks (CNNs), have achieved remarkable progress in recent years, their success often depends on large-scale annotated datasets. However, acquiring such annotations in the remote sensing domain is time-consuming, labor-intensive, and sometimes infeasible [4,5].

Few-shot learning provides a principled paradigm to alleviate label scarcity by enabling models to generalize from only a few labeled examples per class. In remote sensing, this capability is particularly critical for time-sensitive and dynamically evolving scenarios, such as rapid post-disaster assessment or military reconnaissance, where novel object categories (e.g., new target types, camouflaged equipment, emerging infrastructure) must be recognized with very limited supervision [5,6,7]. In such cases, conventional fully supervised detectors trained on static, closed-set benchmarks struggle to adapt, motivating a dedicated line of research on few-shot object detection in remote sensing imagery (RS-FSOD) [8,9]. Compared with few shot object detection in natural images, RS-FSOD must additionally contend with larger scale ranges, more complex and heterogeneous backgrounds (urban, rural, maritime, mountainous), resolution induced domain shifts and extremely high target density in very high-resolution (VHR) imagery [10]. These factors make RS-FSOD not just a direct extension of generic FSOD, but an independent and more challenging research problem.

Few-shot object detection (FSOD) extends few-shot learning by jointly solving classification and bounding-box localization [8,11]. Existing FSOD approaches can be broadly grouped into meta-learning-based and transfer learning-based methods [8]. On the meta-learning side, MetaYOLO [12] learns class-agnostic feature importance and dynamically updates detector weights from a few support examples, while Meta R-CNN [13] infers class-specific attention vectors from support images and uses them to modulate RoI features. FsDetView [14] further introduces a class encoder and query encoder whose outputs are fused by channel-wise multiplication, difference, and concatenation, and Prototype-CNN [15] employs class prototypes to guide proposal generation in remote sensing scenes. However, these meta-learning methods mostly inject class-conditional cues at a late RoI stage or via predominantly linear fusion, and rarely perform explicit, support-conditioned multi-scale allocation in the proposal process, which can limit recall under cluttered, scale-diverse remote-sensing scenes.

Transfer learning-based methods instead adapt a pre-trained detector to novel classes with limited annotations. Chen et al. [16] combine SSD and Faster R-CNN with background suppression regularization to improve data efficiency, while TFA [17] fine-tunes only the final layers with instance-level feature normalization to reduce intra-class variance and catastrophic forgetting. For remote sensing images, Zhao et al. [18] introduce an involution-based backbone with path aggregation and shape bias, and Zhang et al. [19] design a framework that couples metric-based loss with representation compensation and knowledge distillation. Despite these advances, most transfer-learning pipelines remain support-agnostic at the multi-scale feature selection stage and they do not explicitly use class-specific cues to steer the feature pyramid or RPN toward appropriate scales for novel categories, which is particularly problematic for small or densely distributed targets in remote sensing imagery.

From the perspective of how support information is fused into the query stream, FSOD methods have evolved along a relatively clear trajectory. Early work adopted shallow, almost plug-in fusion strategies: support and query features were concatenated, added or multiplied channel-wise, and then passed to a linear detection head [12,17]. These designs are easy to optimize, but the support signal only weakly modulates the query features and tends to under-utilize the limited support set, especially when novel categories are visually similar, heavily cluttered, or small in scale as in remote sensing scenes. After that, prototype-based conditioning is introduced. Meta R-CNN [13] learns class-specific attention vectors from support images to reweight RoI features, FsDetView [14] fuses a class encoder and a query encoder via multiple elementwise branches, and prototype-guided RPNs [15] bias proposal generation using learned class prototypes. These approaches push support–query interaction deeper into the network, but fusion remains mostly linear and is often applied only at a late RoI stage.

More recently, non-linear and higher-order fusion mechanisms have begun to attract attention. Cross-attention modules [20] condition one feature stream on another via query–key–value interactions, enabling adaptive, context-aware modulation of query features by the support set. Bilinear and second-order pooling modules [21] explicitly model multiplicative channel relationships and richer support–query statistics, and have been explored to enhance fine-grained discrimination in FSOD. However, naive applications of cross-attention and bilinear fusion are typically parameter-intensive and data-hungry, which is problematic in RS-FSOD where annotations are scarce and domain shifts are pronounced. Moreover, many existing architectures still inject support information only at the RoI head, leaving the Region Proposal Network (RPN) and multi-scale feature pyramid largely support-agnostic, which is problematic when remote sensing targets are small, densely distributed, and heavily scale-dependent. In this trajectory, our proposed framework can be viewed as a lightweight, remote-sensing–oriented continuation.

Meta-learning is designed specifically for scenarios with extremely limited samples [22]. Rather than being categorically “better” than transfer learning, meta-learning offers a complementary route that emphasizes fast adaptation from few examples while transfer learning preserves base-class knowledge via careful fine-tuning. Despite recent advances, RS-FSOD on meta-learning still faces three key issues. First, current methods struggle to effectively utilize support set information to enhance the RPN module, and the feature fusion between support and query images remains suboptimal [9]. Second, large-scale variations and strong intraclass diversity in remote sensing imagery are not well addressed, leading to reduced detection robustness. While stronger backbones and path aggregation with shape bias [18] improve multi-scale robustness, these components are largely support-agnostic and thus do not explicitly encode class-conditioned scale feature for novel categories. Third, the fusion between RoI features and support features is often overly simplistic, limiting the model’s ability to capture their complex relationships and adapt to novel classes. Common heads rely on cosine similarity or single-branch elementwise fusion (e.g., multiplication, difference, or concatenation as in [14]; lightly fine-tuned linear heads as in [17]), which underuse complementary signals.

To address the above three issues, we propose an Enhanced Feature Fusion Module (EFFM) that integrates class-specific prior information from support features into the query features, thereby enhancing the query’s responsiveness to support instances while preserving the structural information of the query image. The EFFM consists of two sub-modules: the Multi-scale Fine-grained Support Feature Extraction Module (MFSFEM) and the Feature Allocation Fusion Mechanism (FAFM). MFSFEM is responsible for extracting rich and detailed support features across multiple scales, aiming to tackle the challenges of significant intra-class variations and small inter-class differences in remote sensing imagery. FAFM then adaptively allocates and fuses these extracted features into the query representation, ensuring more precise alignment and interaction between support and query features.

On top of this, we further introduce a Non-Linear Fusion Module (NLFM), which improves upon conventional RoI feature fusion strategies in meta-learning. NLFM enhances the model’s representational capacity and better captures complex semantic relationships, allowing it to more effectively handle the intricate and heterogeneous features present in remote sensing images.

The main contributions of this paper can be summarized as follows: (1) We propose an Enhanced Feature Fusion Module composed of MFSFEM and FAFM, which jointly extract and inject fine-grained, multi-scale support features into query representations for better detection of novel objects. (2) We introduce a Non-Linear Fusion Module that models complex interactions between support and query features, improving feature discriminability and generalization under few-shot settings. (3) We build a unified meta-learning detection framework that integrates EFFM and NLFM in both proposal and classification stages. Experiments on NWPU VHR-10, iSAID and DIOR demonstrate significant performance gains over state-of-the-art methods.

2. Materials and Methods

In this section, we provide a detailed description of our meta-learning-based few-shot object detection method, which is designed based on the Faster R-CNN [23] framework. The core components of our model include the backbone, RPN, EFFM, NLFM, and RCNN head as shown in Figure 1. EFFM is designed to embed class-specific information from support features into query features, enhancing the model’s ability to detect novel objects. It is composed of MFSFEM and FAFM. MFSFEM extracts fine-grained support features at multiple scales (P3–P5) using learnable query vectors and cross-attention. FAFM then fuses these features into the corresponding query feature maps based on similarity, producing support-aware query representations. This feature-level integration guides the RPN to generate proposals more closely aligned with support instances. NLFM enhances the interaction between RoI features and support features through a more expressive fusion strategy. By replacing traditional simple fusion methods, NLFM improves feature alignment and boosts the model’s ability to recognize novel objects in complex remote sensing scenes.

2.1. EFFM

Few-shot object detection based on meta-learning follows a dual-branch framework [26], where both support and query sets are used during training and testing. The Enhanced Feature Fusion Module is a key component designed to bridge these two branches and consists of two parts: the Multi-scale Fine-grained Support Feature Extraction Module and the Feature Allocation Fusion Mechanism. The first module extracts fine-grained, multi-scale semantic features from support instances, while the second module integrates these features into the query image representation. Compared to using raw query features alone, this fusion allows the RPN to generate proposals more effectively by focusing on regions relevant to the support instances, rather than searching blindly across the entire query image.

2.1.1. Multi-Scale Fine-Grained Support Feature Extraction Module

We follow the FPN [27] architecture to obtain multi-scale feature maps, which helps address the large scale variation commonly found in remote sensing imagery. FPN is a feature fusion structure that combines top-down and lateral connections, allowing high-level semantic information extracted from the backbone to be propagated to lower-resolution feature maps. While the original FPN produces feature maps from P2 to P5, in our design we utilize only P3 to P5 for subsequent processing.

As shown in the Figure 2, the Multi-scale Fine-grained Support Feature Extraction Module takes the support instance and extracts its multi-scale features through the backbone and FPN, resulting in feature maps at levels P2 to P5. To capture fine-grained details at different scales, we introduce level feature queries—three learnable query vectors assigned to each support class during training. These vectors correspond to the P3, P4, and P5 feature maps and are used to extract fine-grained semantic representations from each scale. During training, the query vectors are progressively refined to specialize in capturing the most informative features at their respective levels. Similar to the object queries used in DETR [28], we employ a cross-attention mechanism [20] to extract features from the P3, P4, and P5 feature maps. Each level feature query attends to its corresponding feature map, allowing the model to selectively focus on informative regions and obtain fine-grained support representations at multiple scales.

The fine-grained support feature extraction process is performed separately at multiple scales using a cross-attention mechanism between the level feature queries and the corresponding support feature maps. For a given level

l \in {3, 4, 5}

, let the support feature map be denoted as:

X_{s}^{(l)} \in R^{H_{l} \times W_{l} \times d}

(1)

where

H_{l}

and

W_{l}

are the height and width of the feature map at level l, and d is the channel dimension. We flatten the spatial dimensions, yielding:

X_{s}^{(l)} \in R^{N_{l} \times d}, where N_{l} = H_{l} \times W_{l}

(2)

each level is assigned a learnable level feature query vector:

q^{(l)} \in R^{d^{'}}

(3)

To compute the attention between the query and the support feature map, we first project the support features to the same latent space as the query using a linear transformation:

X_{s}^{'} = X_{s}^{(l)} W \in R^{N_{l} \times d^{'}}

(4)

where

W \in R^{d \times d^{'}}

is a learnable weight matrix. We then compute the attention weights:

a^{(l)} = softmax (\frac{q^{(l)} \cdot {(X_{s}^{'})}^{⊤}}{\sqrt{d^{'}}}) \in R^{N_{l}}

(5)

the softmax is applied along the spatial dimension

N_{l}

, and the resulting attention weights are used to aggregate the support features:

p^{(l)} = \sum_{i = 1}^{N_{l}} a_{i}^{(l)} X_{s}^{(l)} [i] \in R^{d}

(6)

here,

p^{(l)}

is the fine-grained support feature vector at level l, which captures the most relevant information from the spatial regions of the support feature map in relation to the level feature query.

During training, the level feature queries

q^{(3)}, q^{(4)}, q^{(5)}

are optimized to adaptively focus on discriminative regions across scales, enabling the extraction of rich semantic cues from support instances. These vectors serve as learnable class-conditioned extractors that help encode subtle visual patterns critical for few-shot recognition in remote sensing scenarios.

2.1.2. Feature Allocation Fusion Mechanism

To integrate support features into the query representation, we design FAFM based on cross-attention. The fusion is performed separately for each FPN level

l \in {3, 4, 5}

. Taking level

P 3

as an example, as shown in the Figure 3, each class in the support set is represented by a fine-grained support feature vector extracted at this level, denoted as:

P^{(3)} = concat (p_{13}, p_{23}, \dots, p_{c 3}) \in R^{c \times d}

(7)

where c is the number of support classes, and d is the channel dimension. Let the query feature map at level

P 3

be flattened as:

X_{q}^{(3)} \in R^{H W \times d}

(8)

To compute similarity between query features and class-level support features, we project both into a shared latent space using a linear layer

W^{'} \in R^{d \times d^{'}}

:

X_{q}^{'} = X_{q}^{(3)} W^{'} \in R^{H W \times d^{'}}, P^{'} = P^{(3)} W^{'} \in R^{c \times d^{'}}

(9)

here, we use a shared projection matrix

W^{'}

for both query and support features in order to reduce the number of parameters and improve training efficiency, especially important under the low-data regime of few-shot learning. Then we compute the cross-attention weights:

A^{'} = softmax (\frac{X_{q}^{'} {(P^{'})}^{⊤}}{\sqrt{d^{'}}}) \in R^{H W \times c}

(10)

the support features are weighted and fused into the query feature map:

X_{q}^{″} = X_{q}^{(3)} + α A^{'} P^{(3)} \in R^{H W \times d}

(11)

where

α

is a hyperparameter that controls the strength of feature fusion. In our implementation,

α

is treated as a manually tunable scalar rather than a learnable parameter, allowing for flexible adjustment based on validation performance or prior knowledge.

This process injects class-aware semantic information into the query features at each scale, allowing the model to focus more effectively on support-relevant regions. This performs a class-conditioned projection of the query toward the subspace spanned by the support features at the same scale, suppressing background and off-class responses. It yields better-aligned, more discriminative maps (lower within-class variance, larger between-class margins), while the residual path preserves training stability.

By applying FAFM across levels

P 3

,

P 4

, and

P 5

, the resulting query feature maps become semantically aligned with the support set. This cross-image feature guidance enhances the performance of RPN, leading to more accurate candidate regions that are conditioned on the target support classes.

2.2. NLFM

To enhance the model’s ability to discriminate between complex object features in remote sensing images, we design Nonlinear Fusion Module to effectively integrate support information into the RoI features.

After proposal generation by RPN, each query RoI is aligned using RoI Align, followed by spatial pooling to produce a fixed-size feature vector, denoted as

f_{roi} \in R^{d}

, where d is the channel dimension.

Meanwhile, for each class in the support set, we extract class-specific semantic representations by applying global average pooling to the P5-level feature maps. For a support image belonging to class i, the resulting support vector is denoted as

f_{\sup}^{(i)} \in R^{d}

.

To better address the challenges of high intra-class variation and limited training data in few-shot detection, we extend the traditional feature fusion strategies by introducing a Nonlinear fusion module as shown in Figure 4. Unlike prior methods that rely on simple element-wise multiplication to combine RoI and support features, our approach applies a richer set of transformations to enhance semantic alignment and category specificity.

For a given RoI feature

f_{roi} \in R^{d}

and a class-specific support vector

f_{\sup}^{(i)} \in R^{d}

, we construct the fused representation

f_{fused}^{(i)} \in R^{3 d}

as follows:

f_{fused}^{(i)} = Concat (ϕ_{1} (f_{roi} ⊙ f_{\sup}^{(i)}), ϕ_{2} (f_{roi} - f_{\sup}^{(i)}), ϕ_{3} (f_{roi}))

(12)

here, ⊙ denotes element-wise multiplication, and each

ϕ_{j} (\cdot)

represents a nonlinear transformation function defined as:

ϕ_{j} (x) = ReLU (LN (W_{j} x + b_{j}))

(13)

where

W_{j} \in R^{d \times d}

and

b_{j} \in R^{d}

are learnable parameters for each transformation branch, LN denotes layer normalization, and ReLU [29] introduces nonlinearity.

This formulation enables the model to simultaneously learn class-aware interactions from the multiplicative term

f_{roi} ⊙ f_{\sup}^{(i)}

, emphasize discriminative differences through the subtraction term

f_{roi} - f_{\sup}^{(i)}

, and preserve spatial and contextual cues via the transformed original RoI feature. These three complementary signals contribute to a more robust and expressive fused representation under limited supervision.

The resulting vector

f_{fused}^{(i)}

is then forwarded to the RCNN [23] classification and regression heads for category prediction and bounding box refinement.

Through this nonlinear and learnable design, NLFM enhances the model’s ability to extract discriminative features under limited supervision—especially when detecting visually similar objects or small targets in complex scenes.

2.3. Loss Function

The overall loss function of our model is inherited from Meta R-CNN [13], consisting of three components: RPN loss, the RCNN detection loss, and a meta-learning-based loss that enhances the discriminability of support features.

The total loss is defined as:

L_{total} = L_{rpn} + L_{rcnn} + λ L_{meta}

(14)

where

L_{rpn}

denotes the objectness classification and bounding box regression loss from the Region Proposal Network,

L_{rcnn}

represents the standard classification and regression losses computed from the fused RoI features, and

L_{meta}

is a meta-level loss designed to enhance the discriminability of support vectors by minimizing intra-class variance and maximizing inter-class separation. The coefficient

λ

is an empirically chosen hyperparameter that controls the relative importance of the meta loss with respect to the standard detection losses.

The RPN loss [30],

L_{rpn}

, is composed of two parts: a binary cross-entropy classification loss [31] for objectness prediction and a Smooth

ℓ_{1}

loss [32] for bounding box regression. It is defined as:

L_{rpn} = \frac{1}{N_{cls}} \sum_{i} BCE (p_{i}, p_{i}^{*}) + \frac{1}{N_{reg}} \sum_{i} 1 (p_{i}^{*} = 1) \cdot {Smooth}_{ℓ_{1}} (t_{i} - t_{i}^{*})

(15)

where

p_{i} \in [0, 1]

is the predicted objectness score for anchor i, and

p_{i}^{*} \in {0, 1}

is the ground-truth label. BCE stands for the binary cross entropy function, and

{Smooth}_{ℓ_{1}}

denotes the smooth L1 loss. The predicted and ground-truth bounding box regression targets are denoted by

t_{i} \in R^{4}

and

t_{i}^{*} \in R^{4}

, respectively. The indicator function

1 (p_{i}^{*} = 1)

ensures that the regression loss is only applied to positive anchors.

N_{cls}

and

N_{reg}

are the total numbers of anchors used for normalization.

The RCNN loss,

L_{rcnn}

, includes multi-class classification and bounding box regression on the fused RoI features. It is formulated as:

L_{rcnn} = \frac{1}{N} \sum_{j} CE (q_{j}, q_{j}^{*}) + \frac{1}{N_{pos}} \sum_{j} 1 (q_{j}^{*} \geq 1) \cdot {Smooth}_{ℓ_{1}} (b_{j} - b_{j}^{*})

(16)

where

q_{j}

denotes the predicted class probabilities for the j-th RoI,

q_{j}^{*}

is the ground truth class label,

b_{j}

and

b_{j}^{*}

are the predicted and true bounding box parameters respectively, CE stands for the cross entropy loss, and

{Smooth}_{ℓ_{1}}

denotes the smooth L1 loss. N is the total number of RoIs in the mini-batch used for classification loss normalization, and

N_{pos}

is the number of positive RoIs used for bounding box regression loss normalization.

The meta loss,

L_{meta}

, is introduced to explicitly enforce discriminative support feature representations. Each support feature vector

f_{\sup}^{(i)}

is passed through a linear classifier

W_{meta}

to produce logits:

z^{(i)} = W_{meta} \cdot f_{\sup}^{(i)}

(17)

which are then compared to the one-hot encoded ground truth label

y^{(i)}

using cross-entropy loss:

L_{meta} = \frac{1}{N} \sum_{i = 1}^{N} CE (z^{(i)}, y^{(i)})

(18)

Together, these losses jointly optimize the model, ensuring accurate object proposals, precise detection, and robust, discriminative support feature learning for few-shot detection scenarios.

2.4. Evaluation Criteria

To comprehensively evaluate the performance of our proposed few-shot object detection model in remote sensing imagery, we adopt a set of standard evaluation metrics widely used in the object detection literature. These metrics assess detection accuracy, localization precision, and the robustness of the model under limited supervision.

(1): Intersection over Union (IoU) [33]: IoU is used to determine whether a predicted bounding box correctly matches a ground truth object.
(2): Average Precision (AP) [34]: Average Precision (AP) measures the area under the precision-recall curve for each class c, and is defined as:

${AP}_{c} = \int_{0}^{1} {Precision}_{c} (r) d r$

(19)

where ${Precision}_{c} (r)$ denotes the precision at recall level r for class c.
(3): Precision, Recall, and F1-Score: These metrics provide detailed insight into the detection performance.

$\begin{matrix} Precision & = \frac{TP}{TP + FP} \end{matrix}$

(20)

$\begin{matrix} Recall & = \frac{TP}{TP + FN} \end{matrix}$

(21)

$\begin{matrix} F 1 ‐ score & = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \end{matrix}$

(22)
(4): Proposal Recall (Recall@N): To evaluate the quality of region proposals generated by the Region Proposal Network (RPN), we adopt the Recall@N metric, which quantifies the ability of the top-N proposals to cover ground-truth objects with sufficient overlap. It is defined as:

$Recall @ N = \frac{1}{| G |} \sum_{g \in G} 1 (max_{p \in P_{N}} IoU (g, p) \geq τ)$

(23)

where $G$ is the set of ground-truth bounding boxes with total number $| G |$ , $P_{N}$ denotes the top-N proposals generated by the RPN, $IoU (g, p)$ represents the Intersection-over-Union between ground-truth box g and proposal p, $τ$ is the IoU threshold (typically set to 0.5), and $1 (\cdot)$ is the indicator function that equals 1 if the condition holds, and 0 otherwise.
(5): Inference Efficiency: We also report inference time per image and model parameter size to assess practical deployment viability, especially for large-scale remote sensing tasks.

3. Results

In this section, we provide a detailed overview of our experimental results. We begin by introducing the dataset utilized in our study, highlighting its main features and outlining the evaluation metrics applied. Following this, we showcase the performance of our meta-learning based few-shot object detection approach on these datasets, benchmarking it against leading state-of-the-art methods. Additionally, to assess the individual contributions of each component within our model, we conduct ablation experiments that quantitatively measure their impact on the overall detection effectiveness.

3.1. Data Set

We conducted experiments on three remote sensing image target detection data- sets, including

(1): NWPU VHR-10 [35]: The NWPU VHR-10 dataset is a widely used benchmark for object detection in very high-resolution remote sensing images. It contains 800 optical images acquired from various urban and rural areas, with spatial resolutions ranging from 0.5 to 2 m. The dataset covers 10 common object categories such as airplanes, ships, storage tanks, baseball diamonds, and vehicles. Each object is annotated with horizontal bounding boxes, providing accurate localization information for detection tasks. NWPU VHR-10 poses challenges including significant variations in object scale, orientation, and complex backgrounds, which reflect real-world scenarios in aerial imagery analysis. Due to its diversity and moderate size, NWPU VHR-10 serves as a valuable benchmark for evaluating and comparing object detection algorithms designed for remote sensing imagery.
(2): iSAID [36]: In remote sensing object detection, iSAID serves as a large, challenging benchmark with 2806 high-resolution images and 655,451 annotated objects spanning 15 common categories. It exhibits dense layouts, strong scale variation, arbitrary orientations, class imbalance, and cluttered backgrounds. Many targets are tiny and visually ambiguous, so effective detectors must leverage multi-scale features and context to localize them reliably. These characteristics make iSAID well suited for evaluating few-shot methods, stressing robustness to small objects, crowding, and appearance diversity without dataset-specific tuning.
(3): DIOR [37]: The DIOR Dataset (Dataset for Object Inspection in Remote Sensing Images) is a large-scale publicly available dataset specifically designed for object detection in remote sensing imagery. It contains over 23,000 images captured from diverse geographic regions, covering both urban and rural environments to ensure a wide range of scenarios and conditions. The dataset comprises 20 object categories frequently encountered in aerial and satellite images, including airplanes, ships, vehicles, storage tanks, and more. Each object instance is annotated with oriented bounding boxes, which precisely capture the object’s orientation and shape, reflecting the variability inherent to remote sensing perspectives. DIOR presents significant challenges such as substantial scale variations, dense and cluttered backgrounds, and high intra-class diversity, making it a comprehensive benchmark for evaluating the robustness and accuracy of object detection methods in remote sensing applications.

3.2. Experimental Setting

Following common conventions adopted by numerous state-of-the-art few-shot object detection methods, we divide the NWPU VHR-10, iSAID and DIOR datasets into base and novel categories to evaluate the generalization capability of our model. The base classes are used for meta-training, where the model learns general object detection knowledge, while the novel classes are reserved for meta-testing to assess few-shot adaptation. This split ensures that the model is tested on categories it has never seen during training, aligning with the standard protocol for few-shot object detection benchmarks. The category split is summarized in the tables below.

Table 1, Table 2 and Table 3 show the specific category split. For a given split, the classes listed under Novel are held out during meta base training, while the remaining classes (base classes) are used to train the detector. We only use novel classes in meta fine tuning phase. Different splits are used to reduce category selection bias.

In our experiments, we adopt a ResNet-101 backbone with FPN structure for image feature extraction. The backbone is initialized with weights pre-trained on the ImageNet dataset, which provides strong low-level and high-level feature representations for downstream detection tasks. This configuration helps the model to effectively capture the scale variance and semantic diversity of remote sensing targets.

We implement our model using the MMFewShot [38] detection framework, which offers modular, reproducible, and scalable implementations for few-shot object detection. The training and inference processes are conducted on a single NVIDIA GeForce RTX 4090 GPU with 24 GB of memory.

To mitigate overfitting and enhance model generalization in the small-shot regime, we employ a comprehensive data augmentation strategy during both meta-training and meta-fine-tuning to improve model generalization. This strategy includes random horizontal and vertical flipping and random rotation to simulate variations in object orientation and viewpoint. To increase robustness against changes in illumination and sensor characteristics, we apply mild random color jittering to brightness, contrast, and saturation. Furthermore, random scaling followed by cropping to the original input size is used to enhance the model’s invariance to object scale. All transformations are applied independently to each image in an episode, with geometric transformations applied consistently to both the image and its corresponding bounding box annotations for support samples. This pipeline significantly increases the effective diversity of the limited training data, helping the model learn more robust and generalizable features.

During the meta-training phase, we train the model for a total of 20,000 iterations using episodic training with a batch size of 15 tasks per iteration. We use the SGD optimizer, setting the initial learning rate to 0.0005, the momentum coefficient to 0.9, and a weight decay of 0.0001 to mitigate overfitting.

In the meta-testing phase, the model is adapted using 300 iterations with a reduced learning rate of 0.0001 to ensure stable learning under the few-shot setting. This two-stage learning framework enables the model to generalize well to novel object categories with limited training samples, while retaining discriminative capabilities on base classes.

3.3. Results on NWPU VHR-10 Dataset

In this subsection, we present the few-shot detection results on the NWPU VHR-10 [35] dataset under different shot settings (3-shot, 5-shot, 10-shot, and 20-shot) across two splits. We compare our proposed method with several state-of-the-art approaches including TFA [16], MetaRCNN [13], PCNN [15], FsDetView [14], G-FSDet [19], FCT [39] and FS-DETR [40]. The evaluation metric used is nAP, which better captures performance on novel categories in few-shot settings. As shown in Table 4, our method consistently outperforms the baselines across all settings, particularly in low-shot regimes, demonstrating superior generalization to novel categories.

Across both splits, our method leads in most shot settings, and the trends align with how each approach fuses support information. TFA, which relies on minimal fine-tuning without explicit support conditioning, lags far behind in all regimes. Meta R-CNN and FsDetView inject support only at the RoI stage and predominantly through linear modulation or element-wise mixing; this late, linear fusion is brittle for look-alike categories in remote sensing. PCNN improves proposals via a single prototype, but this under-represents the strong intra-class diversity typical of aerial targets. G-FSDet offers stronger proposal priors, yet still lacks the combination of early, class-conditioned, multi-scale guidance and expressive RoI fusion. In contrast, our pipeline first uses EFFM to steer the RPN toward class-relevant regions across scales and then applies NLFM to preserve complementary views before mixing. This two-stage alignment explains the data: in the most challenging 3-shot setting we improve over the strongest baseline by about 9 points on Split 1 and 8 points on Split 2; at 10-shot we are on par with the best method on Split 1 but hold a double-digit margin on Split 2; and at 20-shot we retain a clear lead on both splits. These results indicate that early support-aware proposals plus non-linear RoI fusion reduce within-class variance and increase between-class separability, yielding consistent gains from low to high shot regimes.

The visual comparison below illustrates the detection results of our model alongside G-FSOD [19] and P-CNN [15]. G-FSOD [19] fails to detect the novel class tennis court, while P-CNN [15] misses several instances of the small object airplane. Both models exhibit missed detections for the small object class vehicle. In contrast, our model demonstrates superior detection performance, successfully identifying both base and novel classes with higher accuracy and completeness.

Figure 5 illustrates two recurring effects that are consistent with our design and with the quantitative results in Table 4. First, our detector tends to recover small and densely distributed targets (e.g., ships/vehicles) that competing methods miss; this matches EFFM’s role of injecting class-conditioned, multi-scale information into the proposal stage and explains the larger margins at 3-shot on both splits and the clear lead at 20-shot. Second, compared with competing methods, our predictions exhibit tighter localization and higher-confidence true positives under the same NMS settings. Boxes from our model align more closely with object extents, and correct detections appear with stronger confidence, whereas baselines more often yield looser boxes and borderline scores. This is consistent with NLFM’s multi-branch, non-linear RoI fusion which produces more discriminative features and better confidence calibration and complements EFFM’s proposal improvements, aligning with the nAP trends in Table 4.

Table 5 presents the per-class detection accuracy of our proposed model on the NWPU VHR-10 [35] dataset under 3-shot, 5-shot, 10-shot, and 20-shot settings, averaged over three independent runs with random sampling. The results show that detection accuracy generally improves as the number of shots increases, indicating the model’s ability to effectively learn from limited annotations. Notably, even in low-shot scenarios, the model achieves strong performance on classes such as airplane, ship, and harbor, which tend to have consistent shapes and distinct visual cues. In contrast, categories like vehicle and bridge, which exhibit higher intra-class variability, smaller object sizes, or more complex backgrounds, remain more challenging. However, as more training examples are introduced, the model is better able to capture these variations and enhance its detection performance across both base and novel classes, highlighting its robust generalization ability in few-shot detection tasks.

3.4. Results on iSAID Dataset

In this subsection, we report model’s few-shot detection performance on iSAID dataset across two evaluation splits. Compared with NWPU VHR-10, iSAID poses greater difficulties since scenes are more diverse and cluttered, objects exhibit stronger scale variation and orientation changes. Despite this increased complexity, our method attains the best performance across all reported splits and shot settings on iSAID.

From the Table 6, novel AP rises with the number of shots for all methods, with the most pronounced gain typically from 5 to 10 shots and a smaller but consistent improvement from 10 to 20 shots. Our method attains the best result in every split configuration. In Split 1, the advantage over the strongest baseline (G-FSDet) remains stable at roughly six to seven points across 3, 5, 10, and 20 shots, indicating steady benefits as supervision increases. In Split 2, the margin widens as the shot count increases, indicating that additional exemplars help our method handle greater scene and appearance variability more effectively.

The Figure 6 above shows the visual comparison between our model and other baseline. In the visual comparison, both G-FSDet and P-CNN struggle on small targets. Neither of them reliably detects the ship instances. G-FSDet further produces a false positive on the harbor class, while P-CNN entirely misses tennis court and harbor. For the objects that are detected, both baselines exhibit boxes that deviate noticeably from the ground truth. By contrast, our model recovers all objects present, places tighter boxes that closely match object extents, and assigns higher confidence to true positives, reflecting more reliable localization and scoring.

3.5. Results on DIOR Dataset

A performance comparison between our method and other advanced few-shot object detectors on the DIOR [37] dataset is presented in Table 7. Compared to the NWPU VHR-10 and iSAID dataset, the performance on DIOR is generally lower for all methods due to its larger category set, more complex scenes, and higher intra-class variance. Despite these challenges, our model consistently achieves the best results across all four data splits and shot settings, confirming its strong generalization ability.

On DIOR, the split-wise trends mirror how each method handles support information and clutter objects. On Split 1, where most approaches improve with more shots, our performance keeps pace with the strongest baseline at 3-shot and then pulls clear as shots increase, indicating that our RoI fusion continues to function rather than plateauing. On Split 2, which mixes look-alike categories and textured backgrounds, our model leads other competitors indicating early, class-conditioned, multi-scale guidance recovers small and dense targets while the non-linear RoI fusion sharpens localization and confidence in ambiguous scenes. On Split 3, the gap is the largest: as the number of shots increases, our performance keeps improving while the stronger baselines level off. This suggests that methods adding support only at the RoI stage with mostly linear fusion (Meta R-CNN, FsDetView) or using a single prototype (PCNN) do not capture the variation within DIOR classes. On the more difficult Split 4, our method also stays ahead at all shot counts. By contrast, TFA, which fine-tunes without explicit support information, trails in every setting.

When compared to the NWPU VHR-10 [35] dataset results, our model demonstrates more stable improvements on DIOR [37] across all splits, particularly under lower-shot conditions. This suggests that the design of our meta-learning framework including support-guided proposal enhancement and feature fusion not only improves detection accuracy but also enhances robustness in highly diverse scenarios.

Overall, these results on DIOR [37] confirm that our method provides superior performance and generalizability across different remote sensing benchmarks, especially when data is scarce and category distributions are more complex.

The visualization results of our model compared with P-CNN [15] and G-FSDet [19] on the DIOR dataset under the 10-shot setting are shown in Figure 7. From the comparison, we observe that P-CNN [15] fails to detect the bridge category entirely, while both P-CNN [15] and G-FSDet [19] struggle with small objects such as ship, either missing them or localizing them imprecisely. Furthermore, G-FSDet [19] fails to detect ground track field, and although P-CNN [15] manages to produce detections for airport and ground track field, the results suffer from low accuracy and confidence.

These issues largely stem from the models’ limited ability to capture class-specific cues from only a few examples, especially when faced with visually complex backgrounds or subtle object appearances. In contrast, our method shows clear improvements: bridge is correctly detected despite its variable structure; ship and other small objects are localized more precisely; and ground track field and airport are recognized with stronger spatial accuracy. These advantages arise from how our model enhances the interaction between the support examples and query image by allowing the model to focus more on category-relevant regions and suppress irrelevant distractions. This targeted focus is especially beneficial for novel categories with ambiguous boundaries or small sizes, explaining our model’s consistently better visual performance.

Table 8 presents the accuracy of our proposed method for each category in the DIOR [37] dataset. The results are averaged over three independent runs with random sampling and cover both base and novel classes under 3-shot, 5-shot, 10-shot, and 20-shot settings. Among the base classes, categories like Airplane, Windmill, and Basketball court achieve consistently high accuracy across all shot settings, indicating the model’s ability to generalize well when sufficient visual consistency exists. In contrast, categories such as Bridge and Harbor show relatively lower performance, likely due to structural ambiguity and background clutter in remote sensing imagery.

For novel classes, performance is understandably lower due to limited training samples. Nonetheless, the accuracy of classes like Tennis court improves significantly with more shots, reaching 80.13% in the 10-shot setting. On the other hand, Dam and Vehicle remain challenging, especially under low-shot conditions, suggesting their visual variability and small object size pose difficulties for few-shot generalization. Overall, the table highlights the model’s strong performance on a wide range of categories and illustrates both its strengths and the remaining challenges in handling hard-to-detect novel objects.

4. Discussion

To comprehensively assess the effectiveness of our proposed few-shot object detection framework, this section presents an in-depth discussion of its key design choices, sensitivity to hyperparameters, and computational efficiency. We begin with ablation studies to isolate and evaluate the contributions of core modules, particularly EFFM and NLFM, in order to understand how each component influences the detection performance across different shot settings. We then analyze the impact of varying the meta loss weight

λ

to determine the most effective configuration and to assess the model’s robustness to this hyperparameter. Finally, we compare our model’s parameter size and inference speed against other state-of-the-art few-shot detectors to illustrate its practical trade-off between accuracy and computational complexity. These analyses together provide a holistic understanding of the model’s strengths, limitations, and applicability in real-world scenarios.

4.1. Ablation Study

To further evaluate the effectiveness of EFFM, we conduct an ablation experiment focusing on the quality of region proposals generated by the RPN. Specifically, we compare the Recall@N performance of the baseline RPN and RPN augmented with EFFM under different few-shot settings (3, 5, 10, and 20 shots). The metric Recall@N measures the percentage of ground-truth objects that are covered by the top-N proposals (with IoU ≥ 0.5), providing insight into the proposal network’s ability to recall relevant regions for downstream detection.

As shown in Table 9, introducing EFFM consistently improves RPN recall across all values of N and shot settings. For instance, in the 3-shot setting, Recall@100 improves from 42.1% to 55.7%, while Recall@1000 improves from 71.0% to 81.6%, indicating a significant enhancement in proposal coverage despite the scarcity of training data. This trend continues as the number of shots increases: under the 20-shot setting, the RPN with EFFM achieves 67.6% Recall@100 and 90.5% Recall@1000, outperforming the baseline RPN by large margins.

These improvements can be attributed to the ability of EFFM to incorporate support-aware contextual information into the proposal generation process. By fusing fine-grained support features into the query branch at an early stage, EFFM strengthens the RPN’s sensitivity to class-relevant regions, leading to higher-quality proposals even with limited annotations.

To visualize this trend more clearly, we plot the Recall@N curves under each shot setting in Figure 8. The curves consistently show that the RPN with EFFM achieves higher recall across all N values compared to the baseline. Notably, the performance gap is more pronounced in lower shot scenarios, which demonstrates that EFFM is particularly effective in boosting region proposal quality under extreme data scarcity—one of the core challenges in few-shot object detection.

These results confirm that the EFFM module significantly enhances the RPN’s ability to propose relevant candidate regions, laying a stronger foundation for downstream classification and regression stages.

To assess why the proposed NLFM is needed beyond simple RoI–support combinations, we ablate its components at 10-shot on NWPU VHR-10. The results are presented in Table 10. The baseline uses only the concatenation path. Adding an element-wise multiplication branch raises nAP from 75.28 to 77.75, indicating that explicit class-consistent matching helps. Introducing the subtraction branch yields a further increase to 78.31, showing that the similarity and difference paths are complementary. The largest gain appears when we apply the lightweight non-linear projection after concatenation of the three branches, reaching 81.64. This suggests that merely stacking linear operators underuses the complementary signals, and non-linearity is important to mix and reweight branch information, producing more discriminative RoI features under few-shot conditions.

To assess the individual and combined contributions of EFFM and NLFM, we conduct ablation studies under different few-shot settings on the NWPU VHR-10 dataset. The results are presented in Table 11.

The baseline model without EFFM or NLFM achieves relatively modest performance, with a 3-shot nAP of 54.32% and a 20-shot nAP of 76.50%. Introducing EFFM alone results in a substantial performance boost across all settings, with improvements of +4.61%, +4.76%, +5.42%, and +6.77% in 3-shot through 20-shot respectively. This demonstrates the strong ability of EFFM to incorporate class-specific support cues into the proposal generation and feature extraction pipeline, especially in low-data regimes.

When only NLFM is enabled, we observe notable but slightly smaller gains compared to EFFM: improvements of +3.33% in 3-shot and +5.91% in 20-shot. This suggests that NLFM enhances the model’s representation power through non-linear transformations, helping better distinguish subtle differences between object instances with limited examples.

When both EFFM and NLFM are enabled, the model achieves the highest nAP across all shot settings, reaching 61.46% (3-shot) and 87.64% (20-shot). This indicates that EFFM and NLFM are complementary: while EFFM improves support–query alignment at the feature fusion stage, NLFM boosts the discriminability of feature representations at a deeper semantic level. Their joint effect leads to consistently superior performance, verifying the effectiveness of the full architecture.

Moreover, the gain from combining the two modules is more pronounced in lower-shot settings, where data scarcity amplifies the need for effective support-based guidance and feature expressiveness. This reinforces our model’s design motivation—to maximize information utilization and enhance generalization in data-limited detection scenarios.

We also visualize how EFFM reshapes the query features in Figure 9. Before EFFM, the responses are already object-dominated, but they spill into surrounding background and nearby structures are activated, blurring object boundaries. After applying EFFM’s class-conditioned cross-attention, these spurious background activations are substantially reduced while the peaks over true objects are preserved and sharpened.

4.2. Hyperparameter $α$ and $λ$

To further investigate the influence of key hyperparameters in our model, we conduct controlled ablation experiments on the weight coefficient

λ

of the meta loss and the interpolation factor

α

used in the support–query fusion process. Specifically,

λ

balances the contribution of the meta-level supervision to the overall optimization, and thus plays a crucial role in enforcing discriminative support feature representations. On the other hand,

α

controls the extent to which the fused query features retain information from the original query and the adapted support cues. Proper tuning of these hyperparameters is essential to achieving optimal few-shot detection performance. In the following experiments, we systematically vary

λ

and

α

to analyze their individual effects under different shot settings.

Table 12 presents the impact of varying the fusion hyperparameter

α

on few-shot detection performance across different shot settings. The parameter

α

controls the balance between the original query feature and the support-enhanced feature in the fusion stage. As observed from the results, a moderate value of

α = 0.6

consistently yields the best performance under all shot conditions, achieving the highest nAP scores.

When

α

is set too low, the fused features rely heavily on the original query features and lack sufficient support-driven adaptation, which limits performance, especially under low-shot conditions. Conversely, when

α

approaches 1.0, the model over-relies on support information, which may introduce noise or reduce robustness due to the variability and sparsity of support examples. This results in a performance drop, particularly for 3-shot and 5-shot cases.

These results suggest that properly tuning the fusion balance is crucial: too little support information leads to under-utilization of guidance, while too much causes overfitting or loss of discriminative query structure. The optimal value (

α = 0.6

) effectively captures support-based guidance while preserving the structural integrity of the query features, thereby enhancing few-shot generalization.

We also experimented with making

α

a learnable parameter initialized at 0.6, allowing it to adapt during training. While this approach achieved comparable final performance, it introduced training instability in early epochs and increased convergence time by approximately 15%. Additionally, the learned

α

values varied significantly across different category splits, reducing reproducibility. The fixed

α = 0.6

provided more predictable behavior and faster convergence without sacrificing performance.

The ablation study in Table 13 investigates the influence of the meta loss weight

λ

on few-shot detection performance. When

λ = 0

, i.e., the meta loss is not applied, the model exhibits significantly lower performance across all shot settings. This demonstrates the critical role of the meta-learning objective in enhancing the generalization ability of the model to novel classes.

As

λ

increases from 0 to 0.05, the detection accuracy steadily improves, with the best results achieved at

λ = 0.05

. This suggests that a moderate contribution from the meta loss effectively guides the support feature representation to become more discriminative, thus benefiting downstream object detection.

However, further increasing

λ

to 0.1 and 0.2 leads to a slight performance drop. This is likely due to the model overemphasizing the meta loss at the expense of the primary detection objectives, resulting in suboptimal training dynamics. These results indicate that careful tuning of

λ

is essential, and that setting

λ = 0.05

offers the best trade-off between classification and regression objectives and meta-level supervision.

To evaluate the sensitivity and robustness of our hyperparameter choices, we systematically varied

α

and

λ

around their selected values (

α = 0.6

,

λ = 0.05

). Performance remained stable within ±0.2 variations, with nAP fluctuations within ±1.5% across all shot settings on NWPU VHR-10, indicating low sensitivity to minor tuning. The same hyperparameters can be effectively transferred to iSAID and DIOR without adjustment, preserving competitive performance and demonstrating cross-dataset robustness.

4.3. Analysis of Failure Cases

Figure 10 presents three representative failure cases from our experiments on the NWPU dataset under the 10-shot setting. In Case 1, two small ship instances are missed in a dense harbor scene. The red circles highlight these undetected ships. Case 2 shows a missed small airplane that exhibits low contrast against the runway background. Case 3 demonstrates an interesting failure pattern where, among three bridge instances present in the image, only the curved bridge with atypical geometry is missed while the two normal-shaped bridges are correctly detected.

These failure patterns can be analyzed in relation to our model’s architectural design and the inherent challenges of few-shot remote sensing object detection. The missed small targets in Cases 1 and 2 primarily stem from limitations in our MFSFEM. Although MFSFEM extracts support features at multiple scales, extremely small objects occupying only a few pixels in the feature maps may not generate sufficiently strong activations for the level feature queries to capture. This is particularly challenging when such small objects appear in cluttered backgrounds, as the attention mechanism may prioritize more prominent visual features. Additionally, FAFM may allocate insufficient weight to these subtle patterns when integrating support information into query features, especially when the support examples do not adequately represent the full range of scale variations.

Case 3 reveals a different limitation related to our model’s ability to handle significant intra-class shape variations. While our NLFM effectively captures semantic similarity and difference between support and query features, it may struggle with extreme geometric transformations not well-represented in the few support examples. The curved bridge represents an atypical instance that differs substantially from the support examples used during meta training. Our model’s dependence on support prototypes for feature extraction and fusion means that objects deviating significantly from these prototypes may not be adequately recognized.

These failure cases highlight several directions for future improvement. First, enhancing the attention mechanisms in MFSFEM to better capture subtle features of extremely small objects could improve detection of targets like the missed ships and airplane. Second, incorporating more diverse support examples during training, possibly through data augmentation strategies that simulate various object scales and shapes, could increase the model’s robustness to intra-class variations. Third, exploring adaptive thresholding mechanisms in the RPN and classification heads could help recover borderline detections that currently fall below the confidence threshold.

4.4. Complexity and Inference Time

In few-shot object detection, especially in remote sensing applications, it is critical not only to achieve high detection accuracy but also to maintain computational efficiency for practical deployment. Therefore, we evaluate our model’s complexity and inference speed, comparing it with several state-of-the-art few-shot detectors to understand the trade-offs between accuracy and efficiency.

Table 14 reports the number of parameters, inference speed (frames per second, FPS), and inference time per image measured on an NVIDIA RTX 4090 GPU for our method and several competitive baselines.

From the results, our model maintains a moderate parameter size compared to other methods, slightly larger than P-CNN [15] and TFA [16] but significantly smaller than FsDetView [14]. The inclusion of EFFM and NLFM introduces additional computation; however, the inference speed remains competitive at 15.5 FPS, corresponding to an average inference time of 64.5 ms per image.

While G-FSDet [19] achieves the fastest inference speed at 21 FPS, it comes at the cost of a larger model size. In contrast, P-CNN [15] and TFA [16] have fewer parameters but suffer from notably slower inference speeds, limiting their practical applicability in scenarios requiring real-time or near-real-time detection.

We measure the average inference time per query image on the NWPU VHR-10 dataset using a single NVIDIA RTX 4090 GPU, varying K from 1 to 20. The inference time scales sub-linearly with the support set size, as shown in Table 15. This favorable efficiency stems from the synergistic effect of support feature caching and linear-complexity fusion operations. Once extracted, support features are cached and reused across multiple query images, while the EFFM and NLFM modules are designed with linear computational complexity relative to K. Consequently, the model maintains practical inference latency even with larger support sets, enabling the use of richer contextual information without prohibitive computational burden.

Our model strikes a good balance between accuracy and computational efficiency, delivering faster inference than many baselines while maintaining a manageable model size. This balance is crucial for deploying few-shot object detectors in remote sensing systems where both precision and timely response are essential.

In summary, despite the increased complexity introduced by our model’s novel modules, the achieved inference speed and parameter efficiency demonstrate its suitability for real-world applications with reasonable computational resources.

5. Conclusions

In this work, we have proposed a novel meta-learning-based few-shot object detection framework tailored for remote sensing imagery. By introducing EFFM and NLFM, our method effectively leverages support set information to enhance the representation and discrimination of query features. Moreover, the incorporation of a meta-level loss further enforces discriminative support features, improving the overall detection robustness under extremely limited annotated data scenarios.

Extensive experiments on benchmark remote sensing datasets, including NWPU VHR-10, iSAID and DIOR, demonstrate that our approach consistently outperforms state-of-the-art few-shot detection methods across various shot settings. Our ablation studies confirm the individual contributions of each proposed module, highlighting their complementary benefits. Additionally, we provide comprehensive analyses of model complexity and inference speed, verifying that our method achieves a favorable balance between accuracy and computational efficiency.

However, despite these promising results, our model still has some limitations. The Enhanced Feature Fusion Module introduces additional computational overhead due to its internal attention mechanisms, which can limit real-time deployment in resource-constrained environments. Furthermore, while the meta-level loss improves feature discriminability, tuning its weighting hyperparameter requires careful empirical validation, which may restrict the model’s adaptability across diverse datasets. Lastly, our framework currently focuses on single-image detection and does not address temporal or multi-view information that could further improve robustness in remote sensing applications.

Future work will focus on optimizing the computational efficiency of the feature fusion modules, exploring adaptive hyperparameter tuning strategies, and extending the framework to incorporate spatiotemporal cues and multi-sensor fusion. These directions aim to enhance both the practicality and performance of few-shot remote sensing object detection in real-world scenarios.

Author Contributions

Conceptualization, H.Q., W.Z., T.Z. and G.Z.; methodology, H.Q. and W.Z.; software, H.Q.; validation, H.Q., W.Z., T.Z. and G.Z.; formal analysis, H.Q. and W.Z.; data curation, H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, H.Q., W.Z. and T.Z.; visualization, H.Q.; project administration, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC (accessed on 1 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Qin, R.; Liu, T. A Review of Landcover Classification with Very-High Resolution Remotely Sensed Optical Images—Analysis Unit, Model Scalability and Transferability. Remote Sens. 2022, 14, 646. [Google Scholar] [CrossRef]
Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G. Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
Li, X.; Deng, J.; Fang, Y. Few-Shot Object Detection on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5601614. [Google Scholar] [CrossRef]
Alajaji, D.; Alhichri, H.S.; Ammour, N.; Alajlan, N. Few-Shot Learning For Remote Sensing Scene Classification. In Proceedings of the 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Tunis, Tunisia, 9–11 March 2020; pp. 81–84. [Google Scholar] [CrossRef]
Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A.; Feris, R.; Giryes, R.; Bronstein, A.M. RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Antonelli, S.; Avola, D.; Cinque, L.; Crisostomi, D.; Foresti, G.L.; Galasso, F.; Marini, M.R.; Mecca, A.; Pannone, D. Few-Shot Object Detection: A Survey. ACM Comput. Surv. 2022, 54, 242. [Google Scholar] [CrossRef]
Wang, Y.; Li, J.; Guo, J.; Liu, R.; Cao, Q.; Li, D.; Wang, L. SFIDM: Few-Shot Object Detection in Remote Sensing Images with Spatial-Frequency Interaction and Distribution Matching. Remote Sens. 2025, 17, 972. [Google Scholar] [CrossRef]
Wang, L.; Zhang, M.; Gao, X.; Shi, W. Advances and Challenges in Deep Learning-Based Change Detection for Remote Sensing Images: A Review through Various Learning Paradigms. Remote Sens. 2024, 16, 804. [Google Scholar] [CrossRef]
Leng, J.; Chen, T.; Gao, X.; Yu, Y.; Wang, Y.; Gao, F.; Wang, Y. A Comparative Review of Recent Few-Shot Object Detection Algorithms. arXiv 2021, arXiv:2111.00201. [Google Scholar] [CrossRef]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-Shot Object Detection via Feature Reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta R-CNN: Towards General Solver for Instance-Level Low-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Xiao, Y.; Marlet, R. Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild. arXiv 2020, arXiv:2007.12107. [Google Scholar]
Cheng, G.; Yan, B.; Shi, P.; Li, K.; Yao, X.; Guo, L.; Han, J. Prototype-CNN for Few-Shot Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604610. [Google Scholar] [CrossRef]
Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. In Proceedings of the 37th International Conference on Machine Learning (ICML’20), Vienna, Austria, 12–18 July 2020. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Washington, DC, USA, 2018. [Google Scholar] [CrossRef]
Zhao, Z.; Tang, P.; Zhao, L.; Zhang, Z. Few-Shot Object Detection of Remote Sensing Images via Two-Stage Fine-Tuning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021805. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, P.; Jia, X.; Tang, X.; Jiao, L. Generalized few-shot object detection in remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 353–364. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar] [CrossRef]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-Learning in Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5149–5169. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Lin, M.; Chen, Q.; Yan, S. Network in Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
Fan, Q.; Zhuo, W.; Tang, C.K.; Tai, Y.W. Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. arXiv 2016, arXiv:1612.03144. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Rasamoelina, A.D.; Adjailia, F.; Sinčák, P. A Review of Activation Function for Artificial Neural Network. In Proceedings of the 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herlany, Slovakia, 23–25 January 2020; pp. 281–286. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Mao, A.; Mohri, M.; Zhong, Y. Cross-Entropy Loss Functions: Theoretical Analysis and Applications. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 23803–23828. [Google Scholar] [CrossRef]
Wei, L.; Zheng, C.; Hu, Y. Oriented Object Detection in Aerial Images Based on the Scaled Smooth L1 Loss Function. Remote Sens. 2023, 15, 1350. [Google Scholar] [CrossRef]
Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; Yang, R. IoU Loss for 2D/3D Object Detection. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Québec City, QC, Canada, 16–19 September 2019; pp. 85–94. [Google Scholar] [CrossRef]
Henderson, P.; Ferrari, V. End-to-end training of object class detectors for mean average precision. arXiv 2016, arXiv:1607.03476. [Google Scholar]
Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Gupta, A.; Khan, S.H.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.; Bai, X. iSAID: A Large-Scale Dataset for Instance Segmentation in Aerial Images. arXiv 2019, arXiv:1905.12886. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. arXiv 2019, arXiv:1909.00133. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Han, G.; Ma, J.; Huang, S.; Chen, L.; Chang, S.F. Few-Shot Object Detection with Fully Cross-Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5311–5320. [Google Scholar] [CrossRef]
Bulat, A.; Guerrero, R.; Martinez, B.; Tzimiropoulos, G. FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11759–11768. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed few-shot detector. The backbone is ResNet-101 [24], with parallel branches for the query and support images. The Enhanced Feature Fusion Module (EFFM) comprising the Multi-scale Fine-grained Support Feature Extraction Module (MFSFEM) and the Feature Allocation Fusion Mechanism (FAFM) operates before the RPN to inject class-specific features from the support into the query features, biasing proposals toward class relevant regions and improving recall. GAP [25] denotes global average pooling, which produces a support descriptor for the target class. RoI features from the RPN are then nonlinearly fused with the corresponding support descriptor by the Non-Linear Fusion Module (NLFM), and the fused features are fed to the classification and regression heads for final detection.

Figure 2. Architecture of the Multi-scale Fine-grained Support Feature Extraction Module. Support instance denotes preprocessed support image. C1–C5 denote feature maps from different stages of the ResNet-101 backbone. M2–M5 are intermediate representations in the FPN [27]. M2–M5 are derived from C2–C5 by 1 × 1 convolution layers and 2× pooling layer. P2–P5 are the final output feature maps of the FPN which are derived from M2–M5 by 3 × 3 convolution layers. Att layer indicates attention computation layers for fine-grained support feature extraction. For each scale, P3–P5 serves as both key and value while level feature query serves as query in attention mechanism.

Figure 3. An illustration of the Feature Allocation Fusion Mechanism at FPN level

P 3

. The support

P 3

feature map is flattened to form the query matrix Q, while class-wise fine-grained support vectors

p_{13}, p_{23}, \dots, p_{c 3}

are concatenated to serve as both the key and value matrices K and V in a scaled dot-product attention. The attention output is weighted by a fusion coefficient

α

, which controls the strength of support-to-query guidance.

Figure 3. An illustration of the Feature Allocation Fusion Mechanism at FPN level

P 3

. The support

P 3

feature map is flattened to form the query matrix Q, while class-wise fine-grained support vectors

p_{13}, p_{23}, \dots, p_{c 3}

are concatenated to serve as both the key and value matrices K and V in a scaled dot-product attention. The attention output is weighted by a fusion coefficient

α

, which controls the strength of support-to-query guidance.

Figure 4. Illustration of the Nonlinear Fusion Module. The RoI feature vector is obtained from the query branch. The support feature is extracted from the P5 feature map of the support image. These two vectors are fused to form a richer representation. The fused vector is then forwarded to the classification and regression heads for final prediction.

Figure 5. Visual comparison of our model and baseline on NWPU VHR-10 [35] dataset under the 10-shot setting. The novel classes include airplane, tennis court, and baseball diamond. The first row shows the ground-truth annotations. The second, third, and fourth rows display detection results from G-FSOD, P-CNN, and our proposed method, respectively.

Figure 6. Visual comparison of our model and baseline on iSAID dataset under the 10-shot setting. The novel classes include airplane, tennis court, harbor, small-vehicle and ship. The first row shows the ground-truth annotations. The second, third, and fourth rows display detection results from G-FSOD, P-CNN, and our proposed method, respectively.

Figure 7. Visual comparison of our model and baseline on DIOR dataset under the 10-shot setting. The first row shows the ground-truth annotations. The second, third, and fourth rows display detection results from G-FSOD, P-CNN, and our proposed method, respectively.

Figure 8. Recall@N comparison of RPN proposals with and without the EFFM module under different shot settings on the NWPU VHR-10 [35] dataset. The curves demonstrate that integrating EFFM significantly improves the RPN’s ability to generate proposals that cover ground-truth objects, especially under low-shot regimes.

Figure 9. Visualization of EFFM on query features in NWPU VHR-10 dataset. Each row shows the input image, the query feature before EFFM, and the query feature after EFFM, where class-conditioned cross-attention integrates support information.

Figure 10. Visualization of representative failure cases in detection result. Red circles highlight the false negatives in each scenario.

Table 1. Category splits of NWPU VHR-10 dataset under different settings.

Split	Novel			Base
1	Airplane	Baseball diamond	Tennis court	Rest
2	Basketball court	Ground track field	Vehicle	Rest

Table 2. Category splits of iSAID dataset under different settings.

Split	Novel					Base
1	Airplane	Tennis court	Harbor	Small vehicle	Ship	Rest
2	Ship	Airplane	Storage tank	Baseball diamond	Tennis court	Rest

Table 3. Category splits of DIOR dataset under different settings.

Split	Novel					Base
1	Baseball field	Basketball court	Bridge	Chimney	Ship	Rest
2	Airplane	Airport	Expressway toll station	Harbor	Ground track field	Rest
3	Dam	Golf course	Storage tank	Tennis court	Vehicle	Rest
4	Express service area	Overpass	Stadium	Train station	Windmill	Rest

Table 4. Few-shot detection results under different shot settings. Reported values are nAP (novel Average Precision).

	Methods	3-Shot	5-Shot	10-Shot	20-Shot
Split1	TFA	8.80	9.49	9.26	10.83
	MetaRCNN	39.09	45.46	53.95	61.35
	PCNN	41.80	49.17	63.29	66.83
	FsDetView	24.56	29.55	31.77	32.73
	G-FSDet	49.05	56.10	71.82	75.41
	FCT	51.62	60.07	70.43	76.93
	FS-DETR	53.24	62.47	71.55	83.91
	MSFFDet	57.96	66.00	71.02	85.08
Split2	TFA	11.14	12.46	11.35	11.56
	MetaRCNN	39.32	46.10	55.90	58.37
	PCNN	39.01	40.31	45.09	46.28
	FsDetView	50.09	58.75	67.00	75.86
	G-FSDet	53.77	62.78	70.58	82.31
	FCT	57.16	62.91	73.66	83.42
	FS-DETR	58.03	63.75	76.05	83.91
	MSFFDet	61.46	68.86	81.64	87.64

The bold values indicate the best performance among models for each shot setting.

Table 5. The per-class accuracy on the NWPU VHR-10 [35] dataset, averaged over three independent experiments with random sampling. Base columns reports the mAP on base classes before introducing novel classes. The n-shot values for base classes are reported to show the model’s performance on base classes after meta fine-tuning.

	Class/Shot	Base	3-Shot	5-Shot	10-Shot	20-Shot
Base	Airplane	89.7	90.3	95.5	97.6	98.3
	Baseball diamond	90.1	85.1	86.2	89.0	92.1
	Bridge	36.7	33.4	45.3	52.8	58.2
	Harbor	83.2	75.6	80.5	85.9	89.4
	Ship	77.5	80.5	89.8	92.3	95.5
	Storage tank	74.3	76.2	82.4	87.0	91.0
	Tennis court	73.1	70.1	77.0	84.2	87.6
Novel	Basketball	-	55.3	66.4	78.5	83.9
	Ground track field	-	60.7	72.3	84.7	88.5
	Vehicle	-	32.4	48.1	62.4	67.9

Table 6. Few-shot detection results under different shot settings on iSAID dataset. Reported values are nAP.

	Methods	3-Shot	5-Shot	10-Shot	20-Shot
Split1	TFA	8.77	9.62	9.11	10.95
	MetaRCNN	22.42	24.12	26.28	27.02
	PCNN	30.33	31.88	32.05	32.21
	FsDetView	26.90	29.02	32.04	32.10
	G-FSDet	35.77	36.64	37.10	38.06
	FCT	39.85	40.12	40.89	41.57
	FS-DETR	41.35	41.68	42.21	42.97
	MSFFDet	42.12	42.74	43.95	44.39
Split2	TFA	11.05	12.63	12.20	13.69
	MetaRCNN	27.70	28.62	29.21	30.01
	PCNN	34.67	34.92	35.81	37.59
	FsDetView	30.66	30.12	31.58	32.41
	G-FSDet	39.41	39.04	40.91	42.05
	FCT	41.75	42.18	44.63	46.28
	FS-DETR	43.92	44.85	47.91	49.67
	MSFFDet	45.83	46.57	49.22	50.93

The bold values indicate the best performance among models for each shot setting.

Table 7. The comparison results of our proposed method with state-of-the-art few-shot object detectors on the DIOR [37] dataset. We set K = 3, 5, 10, 20 in our experiments. All results represent the average of three independent runs with random sampling.

	Methods	3-Shot	5-Shot	10-Shot	20-Shot
Split1	TFA	11.35	11.57	15.37	17.96
	MetaRCNN	15.62	18.91	22.04	24.21
	PCNN	18.00	22.80	27.60	29.60
	FsDetView	13.19	14.29	18.02	18.01
	SAGSTFS	29.3	31.60	31.60	40.20
	G-FSDet	27.57	30.52	37.64	39.83
	FCT	28.42	31.25	38.45	40.38
	FS-DETR	28.95	31.62	38.78	40.65
	MSFFDet	29.03	31.87	38.93	40.72
Split2	TFA	5.77	8.19	8.71	12.18
	MetaRCNN	10.50	11.03	13.44	18.11
	PCNN	14.50	14.90	18.90	22.80
	FsDetView	10.83	9.63	13.57	14.76
	SAGSTFS	12.60	15.50	15.50	23.80
	G-FSDet	14.13	15.84	20.70	22.69
	FCT	16.45	18.76	24.63	26.88
	FS-DETR	17.82	20.35	26.94	29.12
	MSFFDet	19.02	21.74	28.91	30.79
Split3	TFA	8.36	10.13	10.75	17.99
	MetaRCNN	12.86	14.34	17.23	23.10
	PCNN	16.50	18.80	23.30	28.80
	FsDetView	7.49	12.61	11.49	17.02
	SAGSTFS	20.90	24.80	24.80	36.10
	G-FSDet	16.03	23.25	26.24	32.05
	FCT	22.45	27.84	34.62	37.28
	FS-DETR	26.18	30.75	37.45	39.68
	MSFFDet	30.11	33.84	40.27	41.93
Split4	TFA	10.42	14.29	14.35	12.01
	MetaRCNN	13.54	15.81	16.67	18.12
	PCNN	15.20	17.50	18.90	25.70
	FsDetView	14.28	15.95	15.37	16.96
	SAGSTFS	17.50	19.70	19.70	27.70
	G-FSDet	16.74	21.03	25.84	31.78
	FCT	18.52	22.64	28.76	32.38
	FS-DETR	20.36	23.78	30.12	32.51
	MSFFDet	22.11	24.63	31.17	32.58

The bold values indicate the best performance among models for each shot setting.

Table 8. The per-class accuracy on the DIOR dataset, averaged over three independent experiments with random sampling. Base columns reports the mAP on all classes before introducing novel classes. The n-shot values for base classes are reported to show the model’s performance on base classes after meta fine-tuning.

	Class/Shot	Base	3-Shot	5-Shot	10-Shot	20-Shot
Base	Airplane	91.21	89.30	80.90	88.21	87.63
	Airport	80.13	73.83	76.09	79.46	85.62
	Expressway toll station	79.42	80.60	74.13	81.27	75.70
	Harbor	61.22	54.81	50.86	58.63	52.18
	Ground track field	81.25	80.99	78.41	75.89	77.60
	Expressway service area	90.04	88.92	91.02	86.14	82.91
	Overpass	57.38	60.42	55.10	56.27	62.93
	Stadium	72.39	68.24	74.52	72.98	70.31
	Train station	62.84	67.91	62.18	58.33	61.24
	Windmill	90.28	91.21	89.51	90.23	92.11
	Baseball field	83.57	79.17	78.25	82.51	81.93
	Basketball court	93.51	92.34	91.24	85.03	92.48
	Bridge	47.35	35.00	43.51	41.82	49.28
	Chimney	82.97	79.96	84.74	85.92	81.13
	Ship	69.42	71.83	70.18	63.80	60.91
Novel	Dam	-	11.27	9.84	22.84	30.65
	Golf course	-	27.52	39.77	51.31	59.13
	Storage tank	-	40.71	47.91	43.82	51.27
	Tennis court	-	74.89	72.05	80.13	78.10
	Vehicle	-	17.74	20.94	29.64	37.00

Table 9. RPN Recall@N (%) comparison on NWPU VHR-10 (Split 1) under different shot settings.

Shot Setting	Method	Recall@100	Recall@300	Recall@1000
3-Shot	Baseline RPN	42.1	58.3	71.0
3-Shot	RPN + EFFM	55.7	70.4	81.6
5-Shot	Baseline RPN	47.6	63.8	75.2
5-Shot	RPN + EFFM	59.2	74.9	85.3
10-Shot	Baseline RPN	51.3	67.5	78.6
10-Shot	RPN + EFFM	63.5	77.8	88.2
20-Shot	Baseline RPN	55.2	71.1	82.3
20-Shot	RPN + EFFM	67.6	81.6	90.5

RPN recall is measured by the percentage of ground-truth boxes covered by the top-N proposals (IoU ≥ 0.5).

Table 10. RoI-level fusion ablation at 10-shot on NWPU VHR-10 dataset. Multiplication denotes elementwise multiplication and subtraction means elementwise subtraction. Nonlinear means nonlinear transformation in NLFM. nAP value has been reported in this table.

Concatenation	Multiplication	Subtraction	Nonlinear	10-Shot
✓				75.28
✓	✓			77.75
✓	✓	✓		78.31
✓	✓	✓	✓	81.64

The checkmark indicates the incorporation of the corresponding module.

Table 11. Ablation study of EFFM and NLFM modules under different shot settings on NWPU VHR-10.

EFFM	NLFM	3-Shot	5-Shot	10-Shot	20-Shot
		54.32	61.05	71.03	76.50
✓		58.93	65.81	76.45	83.27
	✓	57.65	64.72	75.90	82.41
✓	✓	61.46	68.86	81.64	87.64

The checkmark indicates the incorporation of the corresponding module.

Table 12. Effect of the fusion hyperparameter

α

on few-shot detection performance, with experiments conducted on the NWPU VHR-10 dataset.

Table 12. Effect of the fusion hyperparameter

α

on few-shot detection performance, with experiments conducted on the NWPU VHR-10 dataset.

$α$	3-Shot	5-Shot	10-Shot	20-Shot
0.0	52.3	60.5	67.1	75.4
0.2	56.7	64.2	70.5	81.0
0.4	58.1	66.8	72.9	84.7
0.6	61.3	69.5	76.2	87.9
0.8	60.2	67.4	74.1	85.0
1.0	58.5	65.9	72.3	83.2

Table 13. Effect of varying meta loss weight

λ

on detection performance. Experiments are conducted on the NWPU VHR-10 dataset.

Table 13. Effect of varying meta loss weight

λ

on detection performance. Experiments are conducted on the NWPU VHR-10 dataset.

$λ$	3-Shot	5-Shot	10-Shot	20-Shot
0	54.92	60.48	72.36	78.10
0.01	58.33	64.25	76.11	82.47
0.05	61.46	68.86	81.64	87.64
0.1	59.72	66.03	80.12	86.35
0.2	56.80	63.25	78.70	84.01

Table 14. Comparison of model complexity and inference speed on RTX 4090 GPU.

Method	Params (M)	FPS	Inference Time (ms)
P-CNN	52.37	7.00	142.9
TFA	56.53	10.00	100.0
FsDetView	79.63	13.00	76.9
G-FSDet	66.39	21.00	47.6
Meta R-CNN	60.00	12.00	83.3
MSFFDet	63.50	15.50	64.5

Table 15. Inference time of our model with varying support set size (K) on NWPU VHR-10 dataset.

Support Set Size (K-Shot)	1	3	5	10	20
Inference Time (ms)	62.1	63.5	64.8	68.5	77.2
Relative Slowdown	1.00×	1.02×	1.04×	1.10×	1.24×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qi, H.; Zhao, W.; Zhang, T.; Zhou, G. MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection. Appl. Sci. 2026, 16, 917. https://doi.org/10.3390/app16020917

AMA Style

Qi H, Zhao W, Zhang T, Zhou G. MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection. Applied Sciences. 2026; 16(2):917. https://doi.org/10.3390/app16020917

Chicago/Turabian Style

Qi, Haoxiang, Wenzhe Zhao, Ting Zhang, and Guangyao Zhou. 2026. "MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection" Applied Sciences 16, no. 2: 917. https://doi.org/10.3390/app16020917

APA Style

Qi, H., Zhao, W., Zhang, T., & Zhou, G. (2026). MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection. Applied Sciences, 16(2), 917. https://doi.org/10.3390/app16020917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. EFFM

2.1.1. Multi-Scale Fine-Grained Support Feature Extraction Module

2.1.2. Feature Allocation Fusion Mechanism

2.2. NLFM

2.3. Loss Function

2.4. Evaluation Criteria

3. Results

3.1. Data Set

3.2. Experimental Setting

3.3. Results on NWPU VHR-10 Dataset

3.4. Results on iSAID Dataset

3.5. Results on DIOR Dataset

4. Discussion

4.1. Ablation Study

4.2. Hyperparameter $α$ and $λ$

4.3. Analysis of Failure Cases

4.4. Complexity and Inference Time

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. EFFM

2.1.1. Multi-Scale Fine-Grained Support Feature Extraction Module

2.1.2. Feature Allocation Fusion Mechanism

2.2. NLFM

2.3. Loss Function

2.4. Evaluation Criteria

3. Results

3.1. Data Set

3.2. Experimental Setting

3.3. Results on NWPU VHR-10 Dataset

3.4. Results on iSAID Dataset

3.5. Results on DIOR Dataset

4. Discussion

4.1. Ablation Study

4.2. Hyperparameter α and λ

4.3. Analysis of Failure Cases

4.4. Complexity and Inference Time

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Hyperparameter $α$ and $λ$