Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing

Zhang, Yutong; Lyu, Xin; Li, Xin; Zhou, Siqi; Fang, Yiwei; Ding, Chenlong; Gao, Shengkai; Chen, Jiale

doi:10.3390/rs17132136

Open AccessArticle

Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing

by

Yutong Zhang

¹,

Xin Lyu

^1,2,*

,

Xin Li

¹

,

Siqi Zhou

¹,

Yiwei Fang

¹,

Chenlong Ding

¹,

Shengkai Gao

¹ and

Jiale Chen

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 210024, China

²

Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2136; https://doi.org/10.3390/rs17132136

Submission received: 30 April 2025 / Revised: 5 June 2025 / Accepted: 17 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Efficient Object Detection Based on Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

Few-shot object detection (FSOD) in remote sensing remains challenging due to the scarcity of annotated samples and the complex background environments in aerial images. Existing methods often struggle to capture fine-grained local features or suffer from bias during global adaptation to novel categories, leading to misclassification as background. To address these issues, we propose a framework that simultaneously enhances local feature learning and global feature adaptation. Specifically, we design an Extensible Local Feature Aggregator Module (ELFAM) that reconstructs object structures via multi-scale recursive attention aggregation. We further introduce a Self-Guided Novel Adaptation (SGNA) module that employs a teacher-student collaborative strategy to generate high-quality pseudo-labels, thereby refining the semantic feature distribution of novel categories. In addition, a Teacher-Guided Dual-Branch Head (TG-DH) is developed to supervise both classification and regression using pseudo-labels generated by the teacher model to further stabilize and enhance the semantic features of novel classes. Extensive experiments on DlOR and iSAlD datasets demonstrate that our method achieves superior performance compared to existing state-of-the-art FSOD approaches and simultaneously validate the effectiveness of all proposed components.

Keywords:

few-shot learning; object detection; pseudo-label; semi-supervised learning

1. Introduction

Object detection plays a crucial role in remote sensing applications [1,2,3,4], including land-use monitoring, maritime vessel identification, forest fire early warning, and disaster assessment [5,6,7]. However, conventional object detection methods typically rely on large-scale annotated datasets, which are costly to obtain for novel or rare objects in remote sensing imagery [8,9]. Moreover, certain categories may have extremely limited sample availability, making it difficult for models to learn and generalize effectively. Consequently, there has been a growing interest in few-shot object detection (FSOD) within the remote sensing community in recent years [10,11].

FSOD aims to enhance object detection performance while reducing dependence on large-scale annotations [12,13]. The key idea is to integrate object detection [14,15] with few-shot learning [16,17,18,19,20], enabling models to rapidly adapt to new categories using only a few labeled examples. Most FSOD approaches adopt a two-stage learning paradigm: first, pretraining on base classes to acquire generalized features, followed by fine-tuning on novel classes with limited data to improve recognition capability [21,22]. Based on different learning strategies, FSOD can be broadly classified into meta-learning [23,24,25] and fine-tuning approaches [26,27,28]. Meta-learning enhances generalization through task-level optimization, whereas fine-tuning adjusts only specific parameters of the detector while keeping the pretrained backbone fixed to refine detection performance. FSOD not only improves detection accuracy under data-scarce conditions but also adapts effectively to multi-sensor data, complex backgrounds, and multi-scale objects, making it a key research direction in remote sensing object detection.

Despite advancements, existing FSOD methods still face significant challenges in adapting to remote sensing images [25,28]. The primary issue lies in their inability to simultaneously learn both fine-grained local features and global structural representations. Due to the limited training samples in FSOD, novel category instances often cover only partial object regions, preventing the model from reconstructing a complete object shape during inference [24,25,26]. For instance, as illustrated in Figure 1, training samples of a novel category in remote sensing imagery may only include the tail of an aircraft or part of a ship, while at the testing stage, the model is required to recognize the entire vessel or complete building. However, without complete object representations in training, the model tends to focus only on the seen regions, neglecting unseen parts, which leads to incomplete object detection.

Moreover, current FSOD methods suffer from unstable feature adaptation in the global feature space, causing novel categories to be misclassified as background. Given the scarcity of FSOD training samples, the model struggles to acquire sufficient feature support for novel categories, leading to a distribution mismatch between base and novel categories in the global feature space. This discrepancy makes it challenging to differentiate novel objects from the background, ultimately reducing detection performance. Additionally, as shown in the bottom part of Figure 1, FSOD training data often contain missing annotations or noisy bounding boxes. These issues distort the semantic supervision signal and lead to feature drift in the global feature space, where novel class instances shift closer to the background or base classes. This misalignment increases the difficulty of distinguishing novel objects, ultimately degrading detection performance.

To overcome these challenges, we propose a novel FSOD framework that jointly enhances local feature learning and global feature adaptation to address both local feature incompleteness and global adaptation instability in novel category detection. By simultaneously focusing on local feature enhancement and global feature optimization, our framework enables a more comprehensive understanding of novel categories under limited data conditions. Specifically, our framework incorporates two key components: Extensible Local Feature Aggregator Module (ELFAM) for optimized local feature learning and Self-Guided Novel Adaptation (SGNA) for improved global feature adaptation. First, ELFAM enhances local feature learning through a local feature aggregation mechanism, dynamically modeling relationships among different local regions. This allows the model to capture more latent details even with limited samples. Additionally, ELFAM leverages co-existing relationships within base classes to automatically complete missing local information during training, thereby improving the structural integrity of novel objects and ensuring a more complete object representation during inference. Next, we introduce SGNA, which enhances the global feature adaptation capability of novel categories. SGNA utilizes a teacher-student collaborative optimization strategy, where pseudo novel proposals generated by the teacher model improve GT quality and refine the feature distribution of novel categories, ensuring more stable adaptation in the global feature space. Furthermore, we incorporate the Teacher-Guided Dual-Branch Head (TG-DH), which leverages pseudo novel labels provided by the teacher model to optimize both object classification and bounding box regression losses, thereby enhancing detection accuracy and generalization ability for novel categories. By jointly optimizing local and global feature learning, our method enables the model to capture fine-grained details of novel categories while ensuring stable adaptation in the global feature space. As a result, our framework significantly improves few-shot object detection performance in remote sensing tasks. We conducted extensive experiments on multiple publicly available remote sensing datasets. Extensive evaluations on public remote sensing benchmarks confirm the effectiveness of our approach under various few-shot settings, particularly in detecting novel categories with limited supervision. These results highlight the core innovations behind our framework, which are summarized as follows:

We propose a novel framework that simultaneously addresses structural incompleteness and feature misalignment in few-shot object detection for remote sensing imagery by enhancing local structure modeling and stabilizing global semantic adaptation.
The framework incorporates three innovative modules designed to enhance feature learning synergy: (1) an Extensible Local Feature Aggregator Module, which facilitates fine-grained structural representation through dynamic multi-scale feature aggregation; (2) a Self-Guided Novel Adaptation module, which enhances global feature alignment via teacher-student collaborative optimization; and (3) a Teacher-Guided Dual-Branch Head, which decouples base and novel class adaptation using stable pseudo-supervision to improve generalization.
We conduct extensive experiments on DIOR and iSAID, evaluating our proposed method under various few-shot settings. The results demonstrate that our method outperforms state-of-the-art approaches on evaluation metrics and achieves notable performance improvements.

The rest of this paper is organized as follows: Section 2 reviews the related work, with a focus on few-shot object detection (FSOD) and its application to remote sensing. Section 3 introduces our proposed framework, elaborating on the ELFAM and SGNA modules for enhancing local and global feature representations. Section 4 presents the experimental results, including ablation studies and comparisons on the DIOR and iSAID datasets. Finally, Section 5 concludes the paper by summarizing the key contributions and discussing future directions.

2. Related Works

2.1. Object Detection: From CNNs to Transformers

As a core task in computer vision, object detection has evolved from traditional handcrafted feature extraction to deep learning-driven approaches. Current mainstream methods primarily include CNN-based detectors and Transformer-based detectors, each optimized to improve detection accuracy, speed, and adaptability according to different requirements. CNN-based object detectors [29] are typically categorized into two-stage and single-stage paradigms. In the two-stage pipeline, typified by Faster R-CNN [30], a Region Proposal Network (RPN) first identifies candidate regions, which are then refined through subsequent classification and bounding box regression. These approaches achieve high detection accuracy but incur high computational costs. In contrast, single-stage methods, such as YOLO [6,31,32] and SSD [33], directly regress object locations and classify them simultaneously, improving detection efficiency. However, these methods have limitations in small-object detection and dense scenes. Moreover, their generalization ability is weak under limited data conditions, making it difficult to handle novel categories and complex environments effectively. The introduction of Transformers has reshaped the framework of object detection. Detection Transformer (DETR) [34] models object relationships through a self-attention mechanism, eliminating the need for handcrafted RPN and NMS, thereby achieving end-to-end detection. Subsequent methods have further improved its efficiency: deformable DETR reduces computational cost and accelerates convergence by introducing a deformable attention mechanism. DINO further optimizes query strategies and training methods, enhancing detection performance [35]. Meanwhile, detection methods incorporating textual information, such as GLIP [36] and GroundingDINO [37], leverage vision-language alignment strategies to improve open-set detection, enabling models to recognize previously unseen categories. However, despite progress in reducing handcrafted components and enhancing adaptability, existing methods still face significant challenges when encountering data distribution shifts or insufficient annotations. Addressing how to improve the generalization ability of detectors under limited labeled data, reduce reliance on large-scale datasets, and enhance adaptability to new categories and complex scenes remains an urgent issue.

2.2. Few-Shot Object Detection

Despite improvements in detection accuracy and generalization, most object detection methods remain heavily dependent on large-scale annotated datasets, making it difficult to maintain performance in low-data scenarios. Few-shot object detection (FSOD) seeks to address this challenge by enabling models to recognize novel categories with only a few labeled examples. To achieve this, researchers have explored approaches based on fine-tuning, meta-learning, and metric learning, each offering different strategies to improve model adaptability. Fine-tuning-based methods first train detectors on a large set of base classes before fine-tuning them on a small number of novel class samples. While early approaches such as TFA [26] show that modifying only the classification head of Faster R-CNN can yield strong performance, later research, including FSCE [27] and SRR-FSD [38], integrates contrastive learning and semantic modeling to reduce category confusion. However, direct fine-tuning often leads to catastrophic forgetting, where knowledge of base categories deteriorates. To mitigate this, methods like Retentive R-CNN [39] introduce consistency constraints to preserve previously learned representations. Despite these improvements, fine-tuning remains dependent on pre-trained features and requires significant training time for adaptation. To bypass the limitations of fine-tuning, meta-learning methods aim to train models capable of rapid adaptation to new categories. Meta R-CNN [40], for example, incorporates a meta-learning mechanism at the ROI feature level to enhance generalization, while approaches like APANet [41] leverage cross-attention modules to improve classification stability. Some studies further draw insights from few-shot semantic segmentation [42], optimizing feature fusion strategies to refine detection accuracy. Although meta-learning reduces reliance on extensive re-training, it remains susceptible to feature confusion, particularly in scenarios with visually similar categories. Metric learning offers an alternative by constructing feature spaces where similar categories are clustered together while different categories remain well-separated. Methods such as RepMet [43], which employs Gaussian mixture models for feature clustering, and attention-based RPNs, which refine support sample utilization, improve detection robustness without requiring additional fine-tuning. However, the performance of these approaches is still influenced by the quality of support samples and the expressiveness of learned features. While FSOD has made progress in reducing dependence on large-scale annotations, challenges remain in terms of generalization, training efficiency, and inter-class distinction. Future research may focus on integrating self-supervised learning and cross-modal information fusion to further enhance model adaptability [44], reduce reliance on labeled data, and improve detection performance in complex real-world environments.

2.3. Few-Shot Object Detection for Remote Sensing Images

Compared with natural images, remote sensing data exhibit distinct characteristics, including substantial cross-sensor and cross-platform modality diversity, significant variations in object scales, highly complex backgrounds, and high annotation costs. These factors collectively impose severe challenges for few-shot object detection (FSOD) methods, which often rely on pre-trained models that struggle to generalize under such conditions. Fine-tuning-based FSOD approaches frequently encounter domain shifts and suffer from catastrophic forgetting, where performance improvements on novel classes inadvertently compromise detection accuracy on base classes [45,46]. To address these issues, prior studies have explored various strategies, such as regularization techniques, constraint-based learning frameworks, and optimization strategies like meta-learning and metric learning. Despite these efforts, challenges such as class imbalance, multi-modal inconsistency, and resolution variability inherent in remote sensing datasets continue to hinder detection performance. Recent advances incorporating large-scale vision-language models and Transformer-based architectures have shown potential in enhancing feature extraction and improving adaptability to complex remote sensing scenarios [47]. Nonetheless, problems related to noisy data modalities, object scale extremes, and cross-domain knowledge transfer persist. Further research is expected to deepen the integration of multi-modal fusion, self-supervised learning, and cross-domain adaptation, thereby reducing dependence on extensive annotations and enhancing the robustness and generalization capability of FSOD models in diverse remote sensing environments.

3. Method

In this section, we first provide an overview of the proposed framework in Section 3.1. Then, we introduce the formulation of FSOD in Section 3.2. Next, we present our proposed Extensible Local Feature Aggregator Module (ELFAM) to enhance local feature learning in Section 3.3. Subsequently, we describe our Self-Guided Novel Adaptation (SGNA) for optimizing global feature adaptation in Section 3.4. Finally, we analyze the overall model training and optimization process in Section 3.5.

3.1. Overview

The overall architecture of our few-shot object detection framework is illustrated in Figure 2. The framework follows a two-stage design, consisting of a base training stage for learning transferable representations from base categories and a fine-tuning stage for adapting the model to novel categories with limited labeled samples. In both stages, the Extensible Local Feature Aggregator Module (ELFAM) is employed to enhance the completeness of local feature representations by recursively aggregating multi-scale contextual information. During fine-tuning, a Self-Guided Novel Adaptation (SGNA) RPN is introduced to generate high-quality pseudo proposals via a teacher-student paradigm, thereby mitigating feature drift in the global semantic space. In parallel, a Teacher-Guided Dual-Branch Detection Head (TG-Dual BBox Head) supervises classification and regression by fusing pseudo and ground truth labels, enhancing detection stability and generalization on novel classes.

3.2. Problem Formulation

FSOD aims to reduce the dependency of conventional detectors on large-scale annotations by facilitating the recognition of novel categories from only a handful of labeled examples. Specifically, given a dataset

D = {(x, y) | x \in X, y \in Y}

, where x represents the input image and

y = {(c_{i}, b_{i})}_{i = 1}^{K}

contains the class label c and corresponding bounding box b, the FSOD task divides the dataset into base classes

(C_{b})

and novel classes

(C_{n})

, which correspond to the base dataset

(D_{b})

and the novel dataset

(D_{n})

, respectively. These two sets are non-overlapping, meaning

C_{b} \cup C_{n} = C

, and novel class objects remain unseen during the base training stage.

In FSOD tasks, base classes typically have abundant labeled data, whereas novel classes contain only a few annotated samples, often ranging from 1 to 10 instances per class. Due to this extreme data imbalance, models are prone to overfitting when learning novel classes, leading to reduced generalization capability. This tendency primarily arises because the limited number of annotated instances in novel classes constrains the model’s ability to capture intra-class variations, which often leads the detector to memorize the few available samples instead of learning generalizable features. The problem is further compounded by the strong inductive bias inherited from abundant base class training, which may dominate the adaptation process during fine-tuning. Moreover, novel class samples may only cover partial regions of an object, resulting in incomplete object shape information. This makes it challenging for the model to accurately recognize the entire structure of novel objects during inference. To address these challenges, most FSOD methods follow a two-stage fine-tuning paradigm [26,27,28,39], where a general object detection model is first pretrained on the base dataset and subsequently fine-tuned on the novel dataset with limited samples. The training pipeline can be formulated as follows:

\begin{matrix} M init \overset{D_{b}}{\to} M base \overset{D_{n}}{\to} M_{fsod}, \end{matrix}

(1)

where

M init

represents the initialized object detection model,

M base

is the detector trained on base classes, and

M_{fsod}

is the final FSOD model fine-tuned on novel categories. During the base training stage, the model learns generalizable object features from

D_{b}

and acquires strong detection capabilities. In the fine-tuning stage, the model learns novel categories from

D_{n}

, but due to the extremely limited data, it must leverage the knowledge from base categories to enhance the accuracy and generalization ability for novel class detection.

3.3. Extensible Local Feature Aggregator Module

To enhance local feature representation in few-shot object detection (FSOD), we propose the Extensible Local Feature Aggregator Module (ELFAM), which dynamically aggregates local feature information and progressively expands to construct a complete object representation. ELFAM operates on multi-scale feature maps and optimizes target features through an adaptive local feature aggregation mechanism.

(a) Multi-Scale Feature Expansion via Attention Aggregation

Given an input image I, the feature extractor generates a set of multi-scale feature maps

F = {F_{l}}_{l = 1}^{L}

, where each

F_{l} \in R^{H_{l} \times W_{l} \times C}

represents the feature map at level l. As shown in Figure 3, ELFAM progressively expands the attention region, starting from a small set of keypoints and gradually expanding to cover the entire target region.

Unlike conventional attention mechanisms that operate within a single feature scale, ELFAM performs recursive and progressive attention expansion across multiple scales, guided by local context cues and cross-scale feature dependencies. This design enables the model to iteratively grow attention regions from semantically informative keypoints and adaptively integrate multi-scale information, thereby constructing a complete and structurally coherent object representation under few-shot conditions. Specifically, at the lowest scale

F_{1}

, an initial set of keypoints

S_{0}

is selected, and ELFAM recursively expands the attention field from these local regions in various directions. The attention computation at each level depends on the previously captured regions:

\begin{matrix} S_{t} = S_{t - 1} \cup A (S_{t - 1}, F_{t}), \forall t \in {1, \dots, L}, \end{matrix}

(2)

where

A (S_{t - 1}, F_{t})

represents a function that expands the attention region at scale t based on the previous attention field

S_{t - 1}

. At each layer, the expansion direction is adaptive, dynamically adjusting based on object geometry, local feature distribution, and historical attention weights. This ensures that the model captures both the coarse outline at lower resolutions and finer details at higher resolutions.

To further describe the multi-scale adaptive attention expansion, we define the attention expansion operation

E A (x_{q}, l)

. Unlike conventional detectors that rely on fixed or heuristic keypoint proposals, ELFAM treats the initial local keypoints as learnable query positions on the multi-scale feature maps. These keypoints are jointly optimized during training through standard detection losses, including classification and regression supervision, enabling the network to discover semantically meaningful anchors for attention propagation. This learnable design allows the model to automatically refine keypoint locations, thereby mitigating the risk of error accumulation caused by suboptimal initialization. We enhance the attention expansion process by incorporating learnable spatial offsets

Δ p_{q}^{n}

[34,48], allowing each attention direction to flexibly shift the query position in the feature space. This enables the model to attend to semantically meaningful locations beyond the rigid neighborhood and improves its ability to recover complete object structures from partial observations. The updated attention operation is formulated as:

\begin{matrix} E A (x_{q}, l) & = \sum_{n = 1}^{N_{l}} \frac{\exp (((x_{q} + Δ p_{q}^{n}) W_{k}^{n}) {((x_{q} + Δ p_{q}^{n}) W_{k}^{n})}^{T})}{\sum_{m = 1}^{N_{l}} \exp (((x_{q} + Δ p_{q}^{m}) W_{k}^{m}) {((x_{q} + Δ p_{q}^{m}) W_{k}^{m})}^{T})} (x_{q} + Δ p_{q}^{n}) W_{v}^{n} \\ + λ \sum_{j \in N (l)} α_{l j} E A (x_{q}, j) \end{matrix}

(3)

where

x_{q}

represents the query feature point at level l, serving as the initial reference for attention expansion.

Δ p_{q}^{n}

denotes the learnable spatial offset at direction n, allowing the attention to dynamically adjust its sampling position in the feature space.

W_{k}^{n}

and

W_{v}^{n}

denote the key and value transformations in the attention layer, which are applied after spatial shifting.

N_{l}

defines the number of directions used at each scale, ensuring flexible local expansion.

N (l)

refers to neighboring scales providing additional contextual support, and

α_{l j}

is a learnable cross-scale attention weight controlling the contribution from adjacent levels. Finally,

λ

is a balancing factor regulating the transition between local expansion and global feature integration.

(b) Local Feature Aggregation Mechanism

After expanding the attention region across multiple scales, the local feature aggregation mechanism ensures effective fusion of key features, leading to a complete object representation. The aggregated feature at each level is computed as:

\begin{matrix} F_{t}^{'} = ω_{t} F_{t} + \sum_{j \in N (t)} α_{t j} F_{j}, \end{matrix}

(4)

where

ω_{t}

is a learnable weight for scale t,

N (t)

denotes neighboring scales, and

α_{t j}

is an adaptive weight controlling the contribution of scale j to scale t. This aggregation method ensures that local information across scales is effectively combined, enhancing the completeness of novel object representations.

The final object representation is obtained by aggregating features across all scales:

\begin{matrix} F^{*} = \sum_{t = 1}^{L} β_{t} F_{t}^{'}, \end{matrix}

(5)

where

β_{t}

is a learnable coefficient that adjusts the contribution of each scale. This multi-scale fusion approach effectively captures fine-grained information while preserving the complete object structure, enabling the model to balance local detail expression and global target consistency in few-shot object detection tasks. This enhancement significantly improves detection performance, particularly for remote sensing images with varying scales and orientations.

3.4. Self-Guided Novel Adaptation RPN

To address the challenge of insufficient global feature representation for novel classes in few-shot object detection (FSOD), we propose a Self-Guided Novel Adaptation (SGNA) mechanism. SGNA employs a teacher–student collaborative framework with pseudo-label supervision to dynamically enhance the global distribution and stability of novel class features. As shown in Figure 4, the SGNA module consists of three proposal generation branches: a fixed Base RPN, a trainable Student RPN, and a Teacher RPN updated via Exponential Moving Average (EMA). The Base RPN preserves knowledge learned from base classes, the Student RPN is optimized under label supervision, and the Teacher RPN provides stable pseudo-labels to assist learning.

During each training iteration, the Teacher RPN receives feature maps and outputs a set of region proposals

P_{t e a} = {p_{i}}_{i = 1}^{N}

, where each proposal

p_{i}

is associated with a confidence score

s_{i} \in [0, 1]

. High-quality pseudo novel proposals are selected by thresholding the scores using a predefined threshold

τ

, forming the pseudo-label set:

\begin{matrix} P_{p s e u d o} = {p_{i} ∣ s_{i} \geq τ, p_{i} \in P_{t e a}} \end{matrix}

(6)

These pseudo proposals are then combined with the original ground truth annotations of novel classes

G_{n o v e l}

, resulting in an enhanced supervision label set:

\begin{matrix} \hat{G} = G_{n o v e l} \cup P_{p s e u d o} \end{matrix}

(7)

The fused label set

\hat{G}

is used to supervise the Student RPN. Meanwhile, the Base RPN continues to be supervised using the original ground truth annotations from base classes

G_{b a s e}

, in order to maintain detection capability on previously learned categories. The total detection loss is computed on both student and base predictions, including classification and bounding box regression losses:

\begin{matrix} L_{S G N A} & = L_{c l s} (R_{s t u}, \hat{G}) + L_{r e g} (R_{s t u}, \hat{G}) \\ + L_{c l s} (R_{b a s e}, G_{b a s e}) + L_{r e g} (R_{b a s e}, G_{b a s e}) \end{matrix}

(8)

where

L c l s

is the cross-entropy loss and

L r e g

is the smooth L1 loss. This composite loss formulation ensures that the student branch benefits from enhanced pseudo-labeled supervision while the base branch retains stable learning of prior categories, achieving a balanced adaptation in the global feature space. This pseudo-label generation, label fusion, and supervision pipeline are explicitly visualized as three orange paths in our framework diagram, which denote the backward propagation of gradients from pseudo-labels to student parameters. This mechanism enables the model to leverage both annotated and pseudo-labeled data to improve its understanding of novel object structures and enhance global feature consistency.

To ensure the stability of pseudo-label generation, the parameters of the Teacher RPN, denoted as

θ_{t e a}

, are updated via an Exponential Moving Average of the Student RPN’s parameters

θ_{s t u}

:

\begin{matrix} θ_{t e a} \leftarrow μ \cdot θ_{t e a} + (1 - μ) \cdot θ_{s t u} \end{matrix}

(9)

where

μ \in [0, 1]

is the smoothing coefficient. Finally, the region proposals from the Student RPN and the Base RPN are merged to form the final proposal set for downstream classification and regression. By integrating stable structure, supervision enhancement, and self-guided optimization, SGNA effectively improves the global representation of novel classes and significantly boosts detection accuracy and generalization in FSOD for remote sensing tasks.

The training process of SGNA can be regarded as a self-guided pseudo-label optimization pipeline, where the teacher branch identifies potential novel object regions through high-confidence predictions to enrich and complete the supervisory signals. Meanwhile, the student branch is updated under this enhanced supervision and, in turn, refines the teacher parameters via EMA updates, enabling the model to progressively extract meaningful structural features from noisy pseudo-labels. This mechanism forms a closed-loop process that integrates pseudo-label generation, enhanced label construction, and supervised backpropagation, allowing the model to gradually narrow the distribution gap between base and novel categories and improve the separability and stability of novel features in the global semantic space. As illustrated in Figure 4, the three orange arrows depict the supervisory flow from pseudo-labels to enhanced supervision and finally to model optimization.

By introducing the SGNA module, the model maintains structural integrity and distributional stability of novel category features under extremely low annotation conditions. This effectively mitigates adaptation drift caused by data scarcity and provides more discriminative and generalizable feature representations for novel object detection.

3.5. Teacher-Guided Dual-Branch Head

While SGNA enhances the quality of region proposals for novel categories, the detection head still struggles to model complete object structures under extremely limited supervision. To further refine novel object representations and enhance structural consistency, we propose the Teacher-Guided Dual-Branch Head (TG-DH), which leverages stable predictions from a teacher branch to construct high-quality supervision for the student detection head.

As illustrated in Figure 5, TG-DH takes the Regions of Interest (RoIs) generated by SGNA RPN and RoI Align, and feeds them into two parallel detection heads with identical architectures: a Student head and a Teacher head. The Teacher head produces refined pseudo-labels from stabilized predictions, which are then fused with original ground truth annotations to form an enhanced label set

\hat{G} = G n o v e l \cup P p s e u d o

. This enriched supervision set provides more complete and discriminative training signals for the Student head.

The Teacher head is updated via an Exponential Moving Average (EMA) of the Student head to maintain temporal stability:

\begin{matrix} θ_{t e a}^{h e a d} \leftarrow μ \cdot θ_{t e a}^{h e a d} + (1 - μ) \cdot θ_{s t u}^{h e a d} \end{matrix}

(10)

where

μ \in [0, 1)

is the EMA momentum. The final loss is computed on the Student head using the fused label set

\hat{G}

, combining classification and regression losses:

\begin{matrix} L_{T G - D H} = L_{c l s}^{s t u} (\hat{G}) + L_{r e g}^{s t u} (\hat{G}) \end{matrix}

(11)

By continuously updating the Teacher head with stabilized knowledge and using its predictions to refine supervision, TG-DH progressively enhances the structural completeness and localization accuracy of novel objects. Together with SGNA, it forms a unified pipeline that transitions from proposal-level adaptation to RoI-level fine-grained representation learning. When used in conjunction with SGNA, TG-DH further refines RoI-level representations based on improved region proposals, forming a progressive optimization path from coarse adaptation to fine-grained detection. This joint design provides a complete solution that improves both structural fidelity and localization robustness for novel categories under low-data regimes.

4. Experiments

This section reports detection results and evaluation insights obtained on three benchmark remote sensing datasets under different few-shot settings. We analyze the detection performance, compare against state-of-the-art methods, and provide ablation studies to validate the effectiveness of each component in our framework.

4.1. Datasets and Evaluation Protocol

We conduct evaluations on two widely used remote sensing datasets DIOR [49] and iSAID [50] to assess the effectiveness of our method under diverse few-shot detection scenarios. These datasets differ significantly in terms of spatial resolution, object density, scene complexity, and category granularity, thus providing comprehensive coverage of real-world remote sensing environments.

DIOR contains 23,463 optical remote sensing images with 20 annotated object categories, including both human-made and natural targets (e.g., airplanes, ships, bridges, and storage tanks). The images have varying spatial resolutions ranging from 0.5 m to 30 m and exhibit complex backgrounds. We adopt two standard few-shot settings: the first uses 5 manually selected novel categories (e.g., airplane, baseball field) with the remaining as base classes [51], while the second introduces four random base-novel splits to evaluate generalization robustness across different category combinations [52].

iSAID consists of 2806 large-scale satellite images with 655,451 densely annotated object instances across 15 categories. The dataset includes fine-grained objects such as small vehicles, ships, and storage tanks in densely populated urban and port areas. The original images are divided into

800 \times 800

tiles with 25% overlap, balancing memory efficiency and spatial continuity. We follow the standard FSOD setting by adopting three base-novel class splits and evaluating under 10-shot, 50-shot, and 100-shot configurations [46].

Both datasets are evaluated under a two-phase FSOD setting, where the model is first trained on base classes and then adapted to novel categories with limited samples. Performance is evaluated using Average Precision (AP), computed as the area under the precision–recall (PR) curve. Precision and recall are calculated across confidence thresholds by matching predicted boxes to ground truth boxes with an Intersection over Union (IoU) threshold of 0.5.

4.2. Experiment Setting and Implementation Details

Our framework is built upon the Faster R-CNN [5] detector with a ResNet-101 [53] backbone pretrained on ImageNet. To extract hierarchical and multi-resolution features, we incorporate a Feature Pyramid Network (FPN) [54], enabling effective object representation across different spatial scales. The proposed Extensible Local Feature Aggregator Module is integrated on top of the multi-scale feature maps and comprises four stacked Transformer encoder layers. Each encoder layer utilizes a 256-dimensional embedding and performs attention aggregation from recursively expanded keypoint regions. These keypoints are implemented as learnable query positions, which are adaptively optimized during training through standard detection loss supervision.

For region proposal generation during the fine-tuning stage, the Self-Guided Novel Adaptation RPN adopts a dual-branch structure consisting of two identical convolutional subnetworks. These branches serve as the student and teacher, respectively, and are trained under a knowledge distillation scheme. The teacher branch is updated using an exponential moving average (EMA) of the student weights and provides pseudo-labels for the novel class proposals based on high-confidence predictions. Additionally, the Teacher-Guided Dual-Branch Head also follows a student–teacher configuration. Both branches independently process the ROI-aligned features and consist of a fully connected classification head and a bounding box regression head. Pseudo-labels from the teacher branch are fused with available ground truth labels to supervise the student branch, enhancing novel class adaptation while preserving base class knowledge. Optimization is performed using the AdamW optimizer, with an initial learning rate of

1 \times 10^{- 4}

, a weight decay coefficient of 0.01, and a step-wise decay schedule that reduces the learning rate by a factor of 0.1 at predefined iteration milestones.

Training follows a two-stage FSOD pipeline consisting of base training and few-shot fine-tuning. For the base stage, models are trained on DIOR and iSAID for 10 k, 40 k, and 80 k iterations, respectively, with learning rate decays scheduled at 24 k and 32 k iterations on DIOR, and at 40 k and 60 k on iSAID. The fine-tuning phase is conducted for 10 k iterations across both datasets. All input images are uniformly resized to

608 \times 608

pixels, and augmented through multi-scale resizing (scale factors in [0.5, 2.0]), random horizontal flipping, and fixed-angle rotations (90°, 180°, 270°) to improve generalization. Batch sizes are configured as 16 during base training and 8 during fine-tuning. To stabilize pseudo-label generation in SGNA, we apply an Exponential Moving Average (EMA) update [55] with momentum coefficient

α = 0.999

to maintain a temporally smoothed teacher model. During pseudo-label filtering, only candidate boxes with confidence scores above a threshold of

τ = 0.8

are retained to supervise the student branch.

Our implementation is built upon PyTorch version 1.13.0, and leverages EarthNets [56] and MMDetection toolkits [57]. Additional details and code will be made publicly available for reproducibility.

4.3. Quantitative Results

We evaluate the performance of our method on the DIOR and iSAID datasets, as summarized in Table 1, Table 2 and Table 3. The results show that our method shows superior performance across different few-shot settings, particularly for novel classes.

As shown in Table 1, our method achieves superior performance on the DIOR dataset, surpassing all baselines under the 3-, 5-, 10-, and 20-shot settings. For example, at 20-shot, it reaches 65.2% AP on novel classes, which is 3.9 points higher than the best-performing baseline ST-FSOD. Table 2 further demonstrates consistent superiority across different base-novel class splits. Under the challenging 3-shot and 5-shot settings, our method achieves significant gains, obtaining 32.1 AP on split3 with 3-shot and 34.6 AP with 5-shot.

On the iSAID dataset (Table 3), our approach delivers clear improvements under all shots. Compared to ST-FSOD, it achieves average gains of 5.5 AP under 10-shot, 4.6 AP under 50-shot, and 1.9 AP under 100-shot, reflecting strong generalization capability in dense and complex aerial scenes. While our method consistently outperforms existing baselines on novel classes across all few-shot settings, we also observe a slight decrease in base class performance compared to ST-FSOD under certain configurations. This phenomenon reflects a common challenge in few-shot transfer learning: adapting to novel categories with limited supervision can perturb the learned feature space, leading to marginal degradation of base class representations. In our case, the use of pseudo-labels for novel classes may introduce subtle shifts in feature distributions that interfere with base class stability. Despite this, our method maintains a favorable balance between generalization and retention, ultimately achieving superior overall detection accuracy.

These results validate the superiority of our method compared to existing few-shot detection approaches, particularly under scenarios with incomplete annotations and limited supervision. The improvements across different shot settings can be attributed to the joint enhancement of local structural modeling and global semantic adaptation in our framework. Specifically, the incorporation of multi-scale recursive attention aggregation enables more complete object reconstruction from sparse samples, while the pseudo-label-driven feature alignment strategy mitigates feature drift and improves novel-category discrimination. This comprehensive design allows our method to achieve better generalization and higher detection accuracy across diverse novel classes in remote sensing imagery.

4.4. Qualitative Results

We present qualitative comparisons of detection results on the DIOR and iSAID datasets, as illustrated in Figure 6 and Figure 7. These visualizations provide clear evidence of the effectiveness of the proposed method:

Our method improves structural completeness by recursively expanding attention from local keypoints across multiple feature scales, allowing the model to reconstruct full object shapes even when training data provides only partial visual cues. In the third column of the novel class in Figure 6, the airplane is often misclassified as background by standard few-shot detectors due to limited supervision, whereas our method successfully recovers its complete structure with precise localization. Compared to traditional detectors, which often produce fragmented or partial boxes, our ELFAM module progressively expands attention from local keypoints across scales, enabling complete and coherent object representations.
It enhances object recall in challenging contexts such as dense urban areas and mountainous regions by recovering small or ambiguous targets that are frequently missed by conventional detectors. The fourth and fifth columns of the novel class of Figure 6, show dense urban and mountainous scenes where conventional detectors miss numerous small or background-blended targets. Our method recovers more valid objects such as buildings and industrial facilities, indicating strong robustness to visual sparsity and background interference.
Our method maintains stable detection performance for base classes even after fine-tuning novel categories. As shown in Figure 6 and Figure 7, it still achieves high-quality detection for base class targets. This confirms that our candidate separation strategy effectively mitigates negative transfer during fine-tuning, enabling the model to adapt to novel categories without compromising base class performance.

4.5. Ablation Studies

An ablation study is performed on the first split [51] of the DIOR dataset to assess the individual contribution of each component within the proposed framework to overall detection performance. The quantitative results are summarized in Table 4, while the qualitative comparisons are illustrated in Figure 8 and Figure 9. Based on these observations, we summarize the following conclusions:

When using only the basic fine-tuning strategy, the model suffers from insufficient supervision for novel classes, leading to low recall and frequent missed detections. As observed in Figure 8 (second row), novel class instances like windmill and trainstation are often missed or partially detected. Meanwhile, base class performance remains relatively stable but still fluctuates due to interference from unstable novel class adaptation.
With the addition of the ELFAM and SGNA modules, the model achieves significant improvements in structural completeness and semantic consistency. ELFAM enhances the ability to reconstruct full object shapes by aggregating local features across scales, which is especially beneficial for partially annotated or small targets. SGNA introduces high-confidence pseudo-labels and teacher–student optimization, helping to suppress background confusion and improve feature alignment. As shown in the middle rows of Figure 8, both recall and localization quality improve notably.
As shown in the fifth row of Table 4, applying the TG-DH module individually yields notable improvements in detection performance on novel classes, attributed to the enhanced supervision provided by the teacher-guided pseudo-labels. However, this improvement is accompanied by a considerable decline in base class accuracy, suggesting that directly reinforcing novel class feature representations may disrupt the original feature distributions learned for base classes, leading to negative transfer. In contrast, the sixth row of Table 4 indicates that incorporating SGNA alongside TG-DH substantially alleviates this issue by promoting more stable global feature adaptation across both base and novel categories. These results demonstrate the complementary effects of SGNA and TG-DH in promoting global semantic consistency and improving feature adaptation under few-shot object detection settings.
When all modules are integrated, the model produces the most accurate and complete detections across novel categories, while maintaining stable base class performance. As demonstrated in the final row of Figure 8, predictions exhibit clearer boundaries, fewer false positives, and stronger generalization across diverse object types, validating the effectiveness and complementarity of the proposed components.
To provide further insights into the detection behavior, Figure 9 presents the class-wise precision–recall curves on DIOR and iSAID datasets. On DIOR, high-AP categories such as airplane and tenniscourt exhibit sharp and stable precision–recall profiles, indicating strong discriminative feature learning. In contrast, more challenging categories like trainstation and windmill show lower precision and earlier recall degradation, reflecting the inherent difficulty in detecting small or densely distributed objects. On iSAID, despite overall lower APs caused by increased scene complexity, our method maintains stable detection performance across novel classes, demonstrating robustness under few-shot conditions.
To further validate the effectiveness of our framework in addressing practical challenges in few-shot object detection, we provide additional qualitative comparisons against a representative baseline method in Figure 10. From left to right, the examples span different novel categories in the DIOR dataset. In the first three columns, the baseline produces bounding boxes that are inaccurately positioned or excessively large, reflecting typical feature misalignment. In contrast, our method yields tighter and more precisely localized detections by leveraging the SGNA and TG-DH modules, which promote refined cross-scale alignment and stable semantic adaptation. In the last two columns, the baseline fails to detect several valid targets, highlighting the issue of structurally incomplete predictions caused by sparse or partial supervision. Our approach effectively reconstructs complete object structures by recursively expanding from semantically informative keypoints via the ELFAM module. These results further demonstrate our framework’s capability in mitigating both structural incompleteness and feature misalignment under few-shot settings.

4.6. Hyperparameters Sensitivity

A sensitivity analysis is carried out to investigate the effect of two key hyperparameters in the proposed framework: the confidence threshold

τ

used for pseudo-label filtering in the SGNA module, and the momentum coefficient

α

employed in the EMA [55] update of the teacher model. The results on split1 of the DIOR benchmark are summarized in Table 5 and Table 6, respectively. The results show stable performance across settings.

τ = 0.9

offers the best trade-off between novel and base classes, while

α = 0.99

achieves strong overall performance across most settings, offering a balanced trade-off between novel and base classes. Meanwhile,

α = 0.999

tends to favor base class stability, albeit with slightly reduced novel class detection accuracy in some cases. Therefore, we can conclude that our method is robust to hyperparameter variation and does not require extensive tuning to achieve strong and consistent performance.

4.7. Computational Cost Analysis

To analyze the computational efficiency of each proposed component, we examine both training and inference behaviors. Specifically, we measure the average time consumed per iteration during the fine-tuning phase, as well as the inference throughput, quantified by the number of images processed per second. These measurements help quantify the computational cost associated with different module configurations. The complete results are summarized in Table 7.

As shown, ELFAM introduces only marginal overhead, increasing training time slightly from 0.125 to 0.207 seconds per image while maintaining a high inference speed of 15.6 FPS. This is expected, as ELFAM mainly involves localized attention computations over multi-scale features, which are computationally lightweight and efficiently parallelizable.
In contrast, SGNA introduces a more noticeable increase in training time (up to 0.287 seconds per image), due to the inclusion of a teacher–student collaborative path. However, since the number of fine-tuning iterations in FSOD is relatively small compared to the base training phase, the overhead remains acceptable. The decrease in inference speed (down to 8.2 FPS) is also moderate and primarily caused by the additional pseudo-label generation and filtering procedures. These operations can be further optimized with CUDA-based acceleration.
When combining ELFAM and SGNA, the training time rises to 0.378 seconds per image and the inference speed drops to 7.5 FPS. Adding the TG-DH detection head contributes slightly more overhead (0.39 training time, 7.6 FPS), but this is offset by the overall performance gains observed in detection quality.

Overall, the added cost introduced by our modules remains within a practical range and can be effectively traded off for the observed improvements in detection completeness and robustness.

5. Conclusions

Few-shot object detection in remote sensing imagery remains challenging due to incomplete object structures and limited semantic adaptation under sparse supervision. In this work, we present a framework that jointly addresses local structure modeling and global feature alignment to improve novel category detection. By leveraging multi-scale recursive attention aggregation and a teacher–student collaborative adaptation strategy, our method reconstructs complete object representations while enhancing semantic consistency across categories. To further refine supervision at the detection head, we introduce a Teacher-Guided Dual-Branch design that stabilizes classification and localization learning. Extensive experiments on DIOR and iSAID benchmarks demonstrate that our approach achieves superior performance over existing methods across various few-shot settings. Future work will explore enhancing the adaptability to scale variations among objects, incorporating prototype-based differentiation mechanisms to better distinguish fine-grained categories, and refining hyperparameter selection strategies to further stabilize training under extremely low-shot conditions.

Author Contributions

Conceptualization, Y.Z. and X.L. (Xin Lyu); methodology, Y.Z.; software, Y.Z. and S.Z.; validation, Y.Z.; formal analysis, Y.Z. and J.C.; investigation, Y.Z., X.L. (Xin Lyu) and Y.F.; resources, Y.Z. and X.L. (Xin Lyu); data curation, Y.Z. and S.G.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z., X.L. (Xin Lyu) and C.D.; visualization, Y.Z. and S.Z.; supervision, X.L. (Xin Lyu) and X.L. (Xin Li); funding acquisition, X.L. (Xin Lyu). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant Nos. 2024YFC3210801, 2023YFC3209301 and 2023YFC3209201), the National Natural Science Foundation of China (Grant No. 62401196), and the Natural Science Foundation of Jiangsu Province (Grant No. BK20241508).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Gao, L.; Han, Z.; Li, Z.; Chanussot, J. Enhanced Deep Image Prior for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504218. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 2342–2356. [Google Scholar] [CrossRef]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Mei, S.; He, M. Semi-supervised person detection in aerial images with instance segmentation and maximum mean discrepancy distance. Remote Sens. 2023, 15, 2928. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar] [CrossRef]
Wang, Y.; Yao, Q.; Kwok, J.; Ni, L. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 2021, 53, 63. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5400318. [Google Scholar] [CrossRef]
Lang, C.; Cheng, G.; Tu, B.; Han, J. Learning what not to segment: A new perspective on few-shot segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Xiong, Z.; Li, H.; Zhu, X. Doubly deformable aggregation of covariance matrices for few-shot segmentation. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar] [CrossRef]
Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A global-local self-adaptive network for drone-view object detection. IEEE Trans. Image Process. 2021, 30, 1556–1569. [Google Scholar] [CrossRef]
Xu, G.; Song, T.; Sun, X.; Gao, C. TransMIN: Transformer-guided multi-interaction network for remote sensing object detection. IEEE Goesci. Remote Sens. Lett. 2023, 20, 6000505. [Google Scholar] [CrossRef]
Zou, Y.; Zhang, S.; Chen, K.; Tian, Y.; Wang, Y.; Moura, J. Compositional few-shot recognition with primitive discovery and enhancing. In Proceedings of the 28th ACM International Conference on Multimedia (MM), Virtual, 12–16 October 2020. [Google Scholar] [CrossRef]
Zou, Y.; Zhang, S.; Li, Y.; Li, R. Margin-based few-shot class incremental learning with class-level overfitting mitigation. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Sun, Q.; Liu, Y.; Chua, T.; Schiele, B. Meta-transfer learning for few-shot learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
Xi, B.; Li, J.; Li, Y.; Song, R.; Hong, D.; Chanussot, J. Few-shot learning with class-covariance metric for hyperspectral image classification. IEEE Trans. Image Process. 2022, 31, 5079–5092. [Google Scholar] [CrossRef]
Zhao, Y.; Cheung, N. FS-BAN: Born-again networks for domain generalization few-shot classification. IEEE Trans. Image Process. 2023, 32, 2252–2266. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Zheng, K.; Gao, L.; Ni, L.; Huang, M.; Chanussot, J. Model-Informed Multistage Unsupervised Network for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516117. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z. A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Han, G.; Ma, J.; Huang, S.; Chen, L.; Chang, S. Few-shot object detection with fully cross-transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Han, J.; Ren, Y.; Ding, J.; Yan, K.; Xia, G. Few-shot object detection via variational feature aggregation. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Wang, X.; Huang, T.; Darrell, T.; Gonzalez, J.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Qiao, L.; Zhao, Y.; Li, Z.; Qiu, X.; Wu, J.; Zhang, C. DeFRCN: Decoupled faster R-CNN for few-shot object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Zhang, H.; Li, F.; Liu, S. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Li, L.; Zhang, P.; Zhang, H.; Yang, J. Grounded Language-Image Pre-training. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024. [Google Scholar] [CrossRef]
Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic relation reasoning for shot-stable few-shot object detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Guirguis, K.; Hendawy, A.; Eskandar, G.; Abdelsamad, M. Cfa: Constraint-based finetuning approach for generalized few-shot object detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Wang, C.; Yan, T.; Huang, W.; Chen, X.; Xu, K. APANet: Asymmetrical Parallax Attention Network For Efficient Stereo Image Deraining. IEEE Trans. Comput. Imaging 2025, 11, 101–115. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607921. [Google Scholar] [CrossRef]
Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A. Repmet: Representative-based metric learning for classification and few-shot object detection. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Li, Z.; Gao, L.; Jia, X. X-shaped interactive autoencoders with cross-modality mutual learning for unsupervised hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518317. [Google Scholar] [CrossRef]
Huang, X.; He, B.; Tong, M.; Wang, D.; He, C. Few-shot object detection on remote sensing images via shared attention module and balanced fine-tuning strategy. Remote Sens. 2021, 13, 3816. [Google Scholar] [CrossRef]
Wolf, S.; Meier, J.; Sommer, L.; Beyerer, J. Double head predictor based few-shot object detection for aerial imagery. In Proceedings of the 2021 IEEE International Conference on Computer Vision Workshops, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
Li, Z.; Lie, R.; Sun, L.; Zheng, Y. Multi-Feature Cross Attention-Induced Transformer Network for Hyperspectral and LiDAR Data Classification. Remote Sens. 2024, 16, 2775. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H. Deformable convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.; Bai, X. iSAID: A large-scale dataset for instance segmentation in aerial images. arXiv 2019, arXiv:1905.12886. [Google Scholar]
Li, X.; Deng, J.; Fang, Y. Few-shot object detection on remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601614. [Google Scholar] [CrossRef]
Cheng, G.; Yan, B.; Shi, P.; Li, K.; Yao, X.; Guo, L.; Han, J. Prototype-CNN for few-shot object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604610. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Araslanov, N.; Roth, S. Self-supervised augmentation consistency for adapting semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2017. [Google Scholar] [CrossRef]
Xiong, Z.; Zhang, F.; Wang, Y.; Shi, Y.; Zhu, X. EarthNets: Em powering AI in Earth Observation. arXiv 2022, arXiv:2210.04936. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Zhang, Z.; Hao, J.; Pan, C.; Ji, G. Oriented feature augmentation for few-shot object detection in remote sensing images. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021. [Google Scholar] [CrossRef]
Zhao, Z.; Tang, P.; Zhao, L.; Zhang, Z. Few-shot object detection of remote sensing images via two-stage fine-tuning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021805. [Google Scholar] [CrossRef]
Wang, Y.; Xu, C.; Liu, C.; Li, Z. Context information refinement for few-shot object detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 14, 3255. [Google Scholar] [CrossRef]
Li, R.; Zeng, Y.; Wu, J.; Wang, Y.; Zhang, X. Few-shot object detection of remote sensing image via calibration. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6513105. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, B.; Wang, B. Few-shot object detection with self-adaptive global similarity and two-way foreground stimulator in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 15, 7263–7276. [Google Scholar] [CrossRef]
Zhang, S.; Song, F.; Liu, X.; Hao, X.; Liu, Y.; Lei, T.; Jiang, P. Text semantic fusion relation graph reasoning for few-shot object detection on remote sensing images. Remote Sens. 2023, 15, 1187. [Google Scholar] [CrossRef]
Zhang, F.; Shi, Y.; Xiong, Z.; Zhu, X. Few-shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603514. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, P.; Jia, X.; Tang, X.; Jiao, L. Generalized few-shot object detection in remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 353–364. [Google Scholar] [CrossRef]
Xiao, Y.; Lepetit, V.; Marlet, R. Few-shot object detection and viewpoint estimation for objects in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3090–3106. [Google Scholar] [CrossRef]

Figure 1. Illustration of feature drift caused by annotation deficiencies in few-shot object detection for remote sensing. In few-shot training scenarios, partial annotations and missing or noisy bounding boxes are common, especially in remote sensing images. These issues prevent the model from capturing complete object semantics and introduce inaccurate supervision. As a result, the learned representations of novel categories deviate from their true distribution and drift toward the background or base classes in the global feature space, leading to misclassification and degraded detection performance.

Figure 2. Overview of the proposed few-shot object detection framework. The framework consists of two stages. In the base training stage, the model learns generalizable representations from base classes, with ELFAM enhancing local structure features through multi-scale aggregation. In the fine-tuning stage, SGNA RPN generates pseudo novel proposals via a teacher-student scheme to improve feature stability, while TG-Dual BBox Head fuses pseudo and ground truth labels for better novel class adaptation.

Figure 3. Illustration of the Extensible Local Feature Aggregator Module (ELFAM). Starting from keypoint seeds on each scale, the module performs recursive local expansion across multi-scale feature maps, guided by adaptive attention. Features from different scales are aggregated to form a complete object representation with enhanced local structure.

Figure 4. Structure of the Self-Guided Novel Adaptation (SGNA) module. A teacher–student RPN framework is used, where the teacher generates high-confidence pseudo proposals, which are fused with novel ground truths to enhance supervision. The teacher is updated via EMA, enabling stable guidance and effective feature adaptation for novel classes.

Figure 5. Architecture of the Teacher-Guided Dual-Branch BBox Head (TG-DH). The Student head is trained using fused labels comprising ground truths and teacher-generated pseudo-labels, while the Teacher head is updated via EMA and provides stable soft predictions. This collaborative dual-head design enhances classification and regression under limited supervision.

Figure 6. Visualization of few-shot detection results on the DIOR dataset under the 20-shot setting, following the base–novel class split1 defined in [51].

Figure 7. Visualization of few-shot detection results on iSAID dataset under the 100-shot setting, following the base–novel class split1 defined in [46].

Figure 8. Few-shot object detection results on the DIOR dataset under different ablation configurations, with the number of novel shots set to

K = 20

. The base–novel class split follows the split1 protocol defined in [51].

Figure 8. Few-shot object detection results on the DIOR dataset under different ablation configurations, with the number of novel shots set to

K = 20

. The base–novel class split follows the split1 protocol defined in [51].

Figure 9. Class-wise precision–recall curves on DIOR (left) and iSAID (right) datasets. Each curve is computed by matching predicted boxes with ground truth using an IoU threshold of 0.5, followed by confidence-based ranking to derive precision and recall at different thresholds.

Figure 10. Few-shot object detection results on the DIOR dataset under different detection methods, with the number of novel shots set to

K = 20

. The base–novel class split follows the split1 protocol defined in [51]. From top to bottom: ground truth, OFA [58] predictions, and results of our model.

Figure 10. Few-shot object detection results on the DIOR dataset under different detection methods, with the number of novel shots set to

K = 20

. The base–novel class split follows the split1 protocol defined in [51]. From top to bottom: ground truth, OFA [58] predictions, and results of our model.

Table 1. Average Precision (AP, %) at an IoU threshold of 0.5 on DIOR dataset, following the base–novel split protocol defined in [51].

Method	Novel Classes				Base Classes
Method	3-Shot	5-Shot	10-Shot	20-Shot	3-Shot	5-Shot	10-Shot	20-Shot
OFA [58]	32.8	37.9	40.7	-	-	-	-	-
FSODM [51]	-	25	32	36	-	-	-	-
SAM & BFS [45]	-	38.3	47.3	50.9	-	-	-	-
PAMS-Det [59]	28	33	38	-	-	-	-	-
CIR-FSD [60]	-	33	38	43	-	-	-	-
TFACSC [61]	38	42	47	-	-	-	-	-
SAGS & TFS [62]	-	34	37	42	-	-	-	-
TSF-RGR [63]	-	42	49	54	-	-	-	-
ST-FSOD [64]	43.5	48.3	55.8	61.3	75.0	75.6	74.4	73.2
Ours	50.9	52.2	61.9	65.2	76.1	76.0	76.6	73.4

Table 2. Average Precision (AP, %) at an IoU threshold of 0.5 on DIOR dataset, following the base–novel split protocol defined in [52].

Shots	Methods	Novel Classes				Base Classes
Shots	Methods	split1	split2	split3	split4	split1	split2	split3	split4
3	P-CNN [52]	18.0	14.5	16.5	15.2	47.0	48.9	49.5	49.8
	SAGS & TFS [62]	29.3	12.6	20.9	17.5	-	-	-	-
	G-FSOD [65]	27.6	14.1	16.0	16.7	68.9	69.2	71.1	69.0
	ST-FSOD [64]	41.9	17.7	20.9	20.4	73.5	72.5	75.2	73.3
	Ours	34.8	21.9	32.1	28.6	74.5	74.2	76.8	75.8
5	P-CNN [52]	22.8	14.9	18.8	17.5	48.4	49.1	49.9	49.9
	SAGS & TFS [62]	31.6	15.5	24.8	19.7	-	-	-	-
	G-FSOD [65]	30.5	15.8	23.3	21.0	69.5	69.3	70.2	68.0
	ST-FSOD [64]	45.7	20.7	26.0	25.2	73.3	72.7	75.6	73.5
	Ours	42.8	25.9	34.6	35.6	74.5	75.0	76.1	75.7
10	P-CNN [52]	27.6	18.9	23.3	18.9	50.9	52.5	52.1	51.7
	SAGS & TFS [62]	31.6	15.5	24.8	19.7	-	-	-	-
	G-FSOD [65]	37.5	20.7	26.2	25.8	69.0	68.7	71.1	68.6
	ST-FSOD [64]	50.0	27.3	31.3	33.4	72.6	72.3	75.7	73.9
	Ours	52.0	39.6	38.1	46.7	74.2	74.7	75.6	74.1
20	P-CNN [52]	29.6	22.8	28.8	25.7	52.2	51.6	53.1	52.3
	SAGS & TFS [62]	40.2	23.8	36.1	27.7	-	-	-	-
	G-FSOD [65]	39.8	22.7	32.1	31.8	69.8	68.2	71.3	67.7
	ST-FSOD [64]	53.7	33.4	34.6	38.2	73.3	73.3	75.5	73.8
	Ours	47.6	44.5	46.4	50.2	73.6	74.5	76.9	75.5

Table 3. Average Precision (AP, %) at an IoU threshold of 0.5 on iSAID dataset, evaluated under the three base–novel splits defined in [46].

Shots	Methods	Novel Classes			Base Classes
Shots	Methods	split1	split2	split3	split1	split2	split3
10	FSDetView [66]	1.3	8.7	4.6	33.8	29.8	32.9
	TFA [26]	3.3	9.0	3.8	58.6	56.5	59.0
	DH-FSDet [46]	5.2	14.5	9.7	65.0	64.5	67.8
	ST-FSOD [64]	10.2	17.7	14.0	63.7	62.4	66.1
	Ours	12.9	27.9	17.7	64.5	63.4	66.8
50	FSDetView [66]	7.2	26.8	17.1	35.3	30.0	34.6
	TFA [26]	4.7	12.1	5.6	60.7	58.5	60.9
	DH-FSDet [46]	12.8	28.9	19.6	65.1	64.7	68.0
	ST-FSOD [64]	24.8	39.3	31.1	62.8	62.6	65.8
	Ours	34.2	41.9	33.0	64.1	63.0	67.2
100	FSDetView [66]	10.2	32.8	24.1	36.4	30.4	34.5
	TFA [26]	5.0	14.4	5.4	61.4	59.2	61.6
	DH-FSDet [46]	16.7	36.0	23.1	65.2	64.8	68.1
	ST-FSOD [64]	34.3	45.0	33.0	63.3	62.9	65.6
	Ours	38.4	45.3	34.4	65.0	62.0	67.4

Table 4. Ablation results on the first split of the DIOR dataset, as defined in [51]. All experiments are conducted with a fixed random seed to ensure comparability. The column “w/FT” indicates whether the fine-tuning strategy introduced in [26] is applied. Reported results include detection performance on both novel and base classes across different shot configurations, illustrating the contribution of each component to novel class adaptation and base class stability.

Components				Novel Classes				Base Classes
w/FT	ELFAM	SGNA	TG-DH	3	5	10	20	3	5	10	20
✓				20	26.6	33.2	40.9	74.4	65.8	66.8	66.9
✓	✓			25.3	32.2	38.6	48.5	74.8	75.4	75.2	74.3
✓		✓		28.2	33.4	44.2	49.6	74.2	74.8	75.1	74.7
✓			✓	37	32.5	37.1	42.3	69	55.4	56	57
✓	✓	✓		21.9	33.9	41.1	52.5	75.1	74.9	75	74.7
✓	✓		✓	46.3	46.1	50.9	56	69	67.8	68.3	68.7
✓	✓	✓	✓	50.9	52.2	61.9	65.2	76.1	76	76.6	73.4

Table 5. Sensitivity analysis of the EMA momentum coefficient

τ

in the teacher update, conducted under

α = 0.99

on split1 of DIOR benchmark.

Table 5. Sensitivity analysis of the EMA momentum coefficient

τ

in the teacher update, conducted under

α = 0.99

on split1 of DIOR benchmark.

$τ$	Novel Classes				Base Classes
$τ$	3	5	10	20	3	5	10	20
0.9	48.9	52.6	58.3	60.6	76.2	71.8	73.8	73.8
0.95	50.9	52.2	61.9	65.2	76.1	76.6	76.6	73.4
0.99	48.3	55.4	62.2	65.9	76.2	75.8	75.7	75.7

Table 6. Sensitivity analysis of the EMA momentum coefficient

α

in the teacher update, conducted under

τ = 0.9

on split1 of DIOR benchmark.

Table 6. Sensitivity analysis of the EMA momentum coefficient

α

in the teacher update, conducted under

τ = 0.9

on split1 of DIOR benchmark.

$α$	Novel Classes				Base Classes
$α$	3	5	10	20	3	5	10	20
0.9	50.7	54.7	62.1	64.8	75.3	76.4	75.5	76.6
0.99	50.9	52.2	61.9	65.2	76.1	76.6	76.6	73.4
0.999	50.3	51.8	61.1	68.2	76.3	76.3	75.4	75.4

Table 7. Computational cost of the proposed method on DIOR dataset, including per-iteration training duration and inference speed measured in frames per second.

ELFAM	SGNA	TG-DH	Training Times (sec/img)	Inference FPS (img/sec)
			0.125	18.7
✓			0.287	8.2
	✓		0.207	15.6
✓	✓		0.378	7.5
✓	✓	✓	0.39	7.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Lyu, X.; Li, X.; Zhou, S.; Fang, Y.; Ding, C.; Gao, S.; Chen, J. Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing. Remote Sens. 2025, 17, 2136. https://doi.org/10.3390/rs17132136

AMA Style

Zhang Y, Lyu X, Li X, Zhou S, Fang Y, Ding C, Gao S, Chen J. Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing. Remote Sensing. 2025; 17(13):2136. https://doi.org/10.3390/rs17132136

Chicago/Turabian Style

Zhang, Yutong, Xin Lyu, Xin Li, Siqi Zhou, Yiwei Fang, Chenlong Ding, Shengkai Gao, and Jiale Chen. 2025. "Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing" Remote Sensing 17, no. 13: 2136. https://doi.org/10.3390/rs17132136

APA Style

Zhang, Y., Lyu, X., Li, X., Zhou, S., Fang, Y., Ding, C., Gao, S., & Chen, J. (2025). Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing. Remote Sensing, 17(13), 2136. https://doi.org/10.3390/rs17132136

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Complementary Local–Global Optimization for Few-Shot Object Detection in Remote Sensing

Abstract

1. Introduction

2. Related Works

2.1. Object Detection: From CNNs to Transformers

2.2. Few-Shot Object Detection

2.3. Few-Shot Object Detection for Remote Sensing Images

3. Method

3.1. Overview

3.2. Problem Formulation

3.3. Extensible Local Feature Aggregator Module

3.4. Self-Guided Novel Adaptation RPN

3.5. Teacher-Guided Dual-Branch Head

4. Experiments

4.1. Datasets and Evaluation Protocol

4.2. Experiment Setting and Implementation Details

4.3. Quantitative Results

4.4. Qualitative Results

4.5. Ablation Studies

4.6. Hyperparameters Sensitivity

4.7. Computational Cost Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI