Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning

Liu, Peng; Huang, Boxue; Jin, Tingting; Long, Hui

doi:10.3390/rs17071155

Open AccessArticle

Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

³

Key Laboratory of Target Cognition and Application Technology (TCAT), Chinese Academy of Sciences, Beijing 100190, China

⁴

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1155; https://doi.org/10.3390/rs17071155

Submission received: 17 January 2025 / Revised: 20 March 2025 / Accepted: 20 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue Remote Sensing of Target Object Detection and Identification (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

:

In low-shot weakly supervised object detection (LS-WSOD), a small number of strong (instance-level) labels are introduced to a weakly (image-level) annotated dataset, thus balancing annotation costs and model performance. To address issues in LS-WSOD in remote sensing images (RSIs) such as part domination, context confusion, class imbalance, and noise, we propose a novel active learning strategy and an enhanced fine-tuning mechanism. Specifically, we designed a part domination-based adaptive active learning (PDAAL) strategy to discover the most informative and challenging samples for instance-level annotation. PDAAL also applies an adaptive threshold to balance sampling frequencies for long-tailed class distributions. For enhanced fine-tuning, we first developed a parameter-efficient attention for context (PAC) module that learns spatial attention relationships, mitigating context confusion and accelerating the convergence of fine-tuning. Furthermore, we present an adaptive category resampling for tuning (ACRT) mechanism for resampling strong annotation data. ACRT contributes to refining the model at different active stages, especially for under-performed classes, and to reducing the impact of noisy predictions. Experimental results on the NWPU VHR-10.v2 and DIOR datasets show that our method outperforms state-of-the-art LS-WSOD baselines by 4.5% and 3.1% in mAP, respectively, demonstrating that our framework offers an efficient solution for LS-WSOD in RSIs.

Keywords:

remote sensing images; weakly supervised object detection; active learning; low-shot weakly supervised object detection

1. Introduction

In recent years, weakly supervised object detection (WSOD) and low-shot WSOD (LS-WSOD) have gained significant traction in both traditional computer vision and remote sensing image (RSI) applications. WSOD utilizes image-level labels that only contain the classification information of the samples, while LS-WSOD introduces a small number of instance-level labels into WSOD. WSOD methods significantly reduce the demand for detailed manual annotations, enabling large-scale exploitation of partially labeled data. LS-WSOD enhances detection performance with bounding box annotation. Most research on WSOD has primarily focused on integrating multiple-instance learning (MIL) techniques into deep neural network architectures, tackling persistent challenges such as part domination, missed detection, context confusion, and so on [1,2,3,4,5,6]. As these MIL-based WSOD approaches have matured, researchers have recognized their potential in the RSI domain [7,8,9,10,11], where expansive scenes, complex geospatial features, and unbalanced categories create additional detection hurdles.

Nevertheless, WSOD frameworks remain much less accurate than fully supervised frameworks, highlighting the need for more robust strategies. To narrow this gap, recent studies have introduced the concept of low-shot weakly supervised learning [12,13,14,15], which involves supplementing weak labels with a limited amount of strong supervision (instance-level labels). These LS-WSOD methods can considerably improve detection fidelity. By injecting a small set of fully annotated examples, LS-WSOD offers a more efficient means of boosting performance without incurring the high costs associated with labeling every instance. The experiments in these studies show that using a very small number of strong labels (10 shots at most) can significantly improve the performance of WSOD. This demonstrates the importance of low-shot learning in the field of weakly supervised object detection, as well as the strong potential for utilizing the information in weakly supervised labels.

Despite substantial advancements, existing WSOD and LS-WSOD approaches still need to address the following issues:

(1) Part domination. In WSOD, because only image-level labels are provided, the training process lacks explicit information about object location and size. Consequently, the network often prioritizes the most discriminative local features of the object in the image-level classification task, which is shown in Figure 1. This overreliance on local features frequently leads the model to focus on a small portion of the object. As a result, WSOD methods tends to produce bounding boxes that are too small or excessively localized, failing to encompass the entire object region. Affected by part domination, detection accuracy suffers significantly. For example, in conventional vision tasks, WSOD models may detect only a cat’s head instead of the entire cat; in remote sensing images, they may detect only the opening of a chimney rather than the full chimney structure.

(2) Context confusion. As only image-level labels are available, WSOD models in training often overemphasize background and contextual features that frequently co-occur with the object, leading to “context confusion”. Specifically, the model may regard frequently appearing background elements as part of the object itself, which easily results in false positives or missed detections, as shown in Figure 1. This reliance on contextual information not only decreases detection accuracy but also limits the applicability of weakly supervised methods in complex environments. In remote sensing images, where many object categories exhibit strong geographical correlations (e.g., bridges with rivers, dams with reservoirs, and harbors with ships), contextual confusion becomes even more pronounced.

(3) Class imbalance. A long-tailed distribution is commonly observed in most object detection datasets and practical applications, indicating that samples of different classes are unevenly distributed within the dataset. In LS-WSOD, class imbalance leads to fewer strongly labeled samples being selected for minority classes, thereby causing a significant drop in performance for under-performed categories. This issue is particularly prominent in remote sensing tasks, where certain ground objects (e.g., dams, golf, and fields) are far rarer than others (e.g., vehicles and ships), further exacerbating the class imbalance problem.

(4) Excessive noise. Current state-of-the-art LS-WSOD approaches still rely on WSOD models or their outputs for fine-tuning or refinement. However, as WSOD methods generally yield relatively low performance, their models contain a considerable number of errors. Consequently, these LS-WSOD methods are still not capable of effectively harnessing strongly labeled data to overcome the large volume of noise and achieve efficient training. As a result, excessive noise prolongs the training process and adversely affects the model’s performance.

Facing the issues above, potential research directions for LS-WSOD include introducing the following three alternatives. First, active learning strategies can be used, where we strategically choose the most informative samples for strong labeling instead of randomly selecting samples. Specifically, for the LS-WSOD task, active learning can select samples severely affected by part domination. In the meantime, the long-tail distribution of classes should be taken in to consideration. Thereby, active learning could alleviate part domination and class imbalance and improve how the model handles the geospatial complexities of remote sensing images. Second, effective attention mechanisms can be incorporated to enable the model to more clearly distinguish between foreground and background features, thus reducing context confusion and improving the effectiveness of the fine-tuning process. Finally, the model’s detection performance can be actively evaluated during multi-stage fine-tuning. By discovering under-performed classes and adaptively modifying training strategies, the model can better overcome the noise introduced by WSOD methods.

To address the aforementioned issues, we propose a part domination-based active learning and enhanced fine-tuning (PDEF) framework. PDEF includes three novel modules: part domination-based adaptive active learning (PDAAL) is an active learning strategy that identifies the part domination pattern in model outputs to select hard samples for strong label annotation. Meanwhile, it employs an adaptive confidence threshold as a preliminary filter to balance sampling frequencies across classes with imbalanced, long-tailed distributions. Finally, it adopts a least-confidence method to further pinpoint the most challenging samples for the model. Parameter-efficient attention for context (PAC) introduces a parameter-efficient attention mechanism that can be effectively trained during fine-tuning. Unlike the convolutional block attention module (CBAM) and crossed post-decoder refinement (CPDR) [16,17], PAC utilizes one-dimensional (1D) horizontal and vertical convolutions for feature extraction instead of 2D convolutions, improving parameter efficiency while achieving more precise feature representation. Moreover, 1D convolutions can obtain larger receptive fields with a lower parameter cost, which helps capture long attention relationships within images, enhancing the model’s focus on context. As a result, it alleviates context confusion and accelerates the convergence of the training loss. Adaptive category resampling for tuning (ACRT) is employed in multi-stage active learning, where categories with suboptimal performance are resampled along with the corresponding samples in each stage. This process, supported by strong label annotations, enables the model to correct errors from WSOD through fine-tuning, thereby mitigating issues introduced by extensive noise. In summary, the key contributions of this study are as follows:

1.: We propose PDAAL to replace the traditional random strong label sampling in LS-WSOD, specifically addressing the part domination issue commonly encountered in WSOD. Through the discovery of the part domination pattern, adaptive parameters, and confidence evaluation, PDAAL identifies difficult samples, reduces sampling imbalance, and improves the quality of the selected strong annotation samples.
2.: We developed the PAC module to mitigate context confusion in WSOD models. Through the parameter-efficient design of 1D depthwise convolutions, PAC can be effectively trained, even in short-step fine-tuning.
3.: We adopt a multi-stage active learning approach and introduce ACRT. In each stage, we dynamically evaluate performance for each category and resample classes with poor model performance using a targeted approach. This effectively corrects errors in the original WSOD model without significantly increasing training costs.
4.: To the best of our knowledge, this is the first work to apply active learning to remote sensing LS-WSOD. We further integrate PAC and ACRT into our PDEF framework, achieving state-of-the-art results on the NWPU VHR-10.v2 [18] and DIOR [19] datasets. Extensive experiments verify the effectiveness of each proposed module.

In summary, our objective in this study is to propose a novel low-shot weakly supervised object detection framework for remote sensing images that is specifically designed to overcome part domination, context confusion, class imbalance, and noise. To achieve this, we introduce a PDAAL strategy, a PAC module, and an ACRT mechanism, thus enhancing detection performance under limited-annotation conditions.

The remainder of this paper is structured as follows: Section 2 summarizes related research on WSOD, LS-WSOD, and active learning methods in object detection. Section 3 describes our proposed framework. First, the overall structure and loss functions are introduced. Then, each proposed component of the framework—PDAAL, PAC, and ACRT—is introduced individually. Section 4 introduces the experimental setup and analyzes the results of an ablation study, comparative experiments, qualitative analysis, and hyperparameter experiments. Section 5 concludes the paper and outlines future research directions.

2. Related Work

2.1. Weakly Supervised Object Detection

Weakly supervised object detection aims to train detectors using only image-level annotations, thereby significantly reducing the reliance on detailed bounding box labels. In the field of deep learning, Bilen and Vedaldi [1] introduced the weakly supervised deep detection network (WSDDN), which first scores multiple proposals based on MIL, establishing a revolutionary method for WSOD. Building on this idea and aiming to alleviate part domination, Tang et al. [2] proposed the online instance classifier refinement (OICR) framework, which iteratively refines instance classifiers to improve detection performance. In addition, Tang et al. [3] further enhanced localization accuracy by clustering similar proposals via proposal cluster learning (PCL).

The aforementioned methods remain commonly used baselines for WSOD. Subsequent research focused on addressing WSOD challenges, including but not limited to part domination, context confusion, and proposal quality. In previous study [20], a comprehensive attention self-distillation (CASD) method that distills attention maps was developed to boost feature quality. Ren et al. [21] integrated instance-aware, context-focused, and memory-efficient strategies to better differentiate between object and background regions and address part domination and context confusion. In [22], negative deterministic information was employed to correct model predictions in the presence of noisy pseudo-labels. In [23], a segmentation–detection collaborative network is proposed, where segmentation maps are utilized as spatial priors for object proposals, helping to differentiate objects from the background. Cao et al. [24] proposed a category transfer framework that leverages both visually discriminative and semantically correlated category information from fully supervised datasets to enhance the object-classification ability. In [25], a cascade attentive dropout strategy is proposed alongside an enhanced global context module to guide the model toward a more comprehensive understanding of global context. In [26], the authors developed an interactive end-to-end WSOD framework. By incorporating two types of context information—instance-wise and semantic-wise correlation—this study helps to differentiate objects from the background, thereby reducing context confusion. In [6], WeakSAM is proposed, which integrates the segment anything model (SAM) with adaptive pseudo-ground-truth generation and region-of-interest drop regularization to address incompleteness and noise in WSOD.

WSOD in remote sensing addresses the unique challenges of detecting objects in high-resolution satellite and aerial imagery, where objects often vary in scale and are densely packed. In [27], dynamic pseudo-label generation is introduced, in which pseudo-labels are iteratively refined to improve detection accuracy. Yao et al. [28] proposed automatic weakly supervised object detection via dynamic curriculum learning, in which the learning curriculum is dynamically adjusted to improve detection. The angle-free weakly supervised rotating object detection framework (AFWS) was proposed in study [29], which addresses the challenge of detecting rotated objects in remote sensing imagery. Qian et al. [30] introduced complete and invariant instance classifier refinement, utilizing multi-instance learning and soft-label strategies to suppress classification errors in WSOD for remote sensing images. Qian et al. [11] introduced SAM-induced pseudo fully supervised learning, in which the segment anything model (SAM) is leveraged to generate pseudo-labels for WSOD in remote sensing. In [31], progressive image-level and instance-level feature refinement is proposed, refining features at both image and instance levels to improve WSOD in remote sensing.

In the abovementioned WSOD studies, significant efforts have been made to address part domination and context confusion. However, these issues remain inadequately resolved due to the limitations of weak labels lacking precise object localization information. Therefore, introducing a small number of strong labels becomes a feasible approach to enhancing WSOD performance.

2.2. Low-Shot Weakly Supervised Object Detection

Low-shot weakly supervised object detection aims to address the challenge of introducing a small number of strong labels to boost detection performance in the WSOD task. It is worth noting that LS-WSOD differs from semi-supervised (SSOD) and hybrid-supervised object detection. SSOD utilizes both unlabeled and fully labeled data and generally requires more of the latter compared to LS-WSOD. Hybrid supervision aims to develop models that can incorporate various forms of annotations, allowing different types of labels to be used in training. By contrast, LS-WSOD focuses on significantly improving model performance using only a small number of strong labels (e.g., 10 shots).

In a relatively early study [32], a framework combining weakly and semi-supervised learning with an expectation–maximization algorithm was proposed to iteratively refine object detection models using limited annotations. Wang et al. [12] introduced a curriculum learning approach for faster region-based convolutional neural networks (R-CNN). This work enables effective training with weak and semi-supervised data by progressively increasing the complexity of the learning tasks. Pan et al. [13] proposed low-shot box correction, which refines bounding box predictions in weakly supervised settings by leveraging low-shot learning techniques to improve localization accuracy. Biffi et al. [14] presented a method for learning from mixed supervision, combining low-shot and many-shot annotations to enhance object detection performance in data-scarce scenarios. Recent advancements include [15], which introduced the instance transformation network (ITNet). Designed specifically for weakly supervised object detection in remote sensing images, the ITNet addresses the challenges of limited annotations and complex object distributions in RSIs.

By introducing a small number of strong labels, LS-WSOD has significantly improved detection performance and alleviated issues such as part domination and context confusion. However, most previous studies have relied on random sampling for strong label annotation, which fails to select the most informative samples that the model needs the most. Additionally, due to the imbalanced, long-tailed distribution of different sample types within datasets, under-performed categories are not fairly sampled. Addressing these challenges remains a critical issue for LS-WSOD and is the focus of the method proposed in this paper.

2.3. Active Learning for Object Detection

Active learning (AL) for object detection and weakly supervised object detection aims to reduce annotation costs by strategically selecting the most informative samples for strong labeling to train detectors. These approaches address the high annotation costs and the lack of precise localization labels in large-scale datasets. In the context of AL for object detection, a probabilistic modeling framework was proposed in [33] to select uncertain samples for annotation, improving the efficiency of deep object detection models. Polk et al. [34] introduced the active diffusion and vertex-component-analysis-assisted image segmentation (ADVIS) algorithm. This work selects high-purity, high-density pixels that are distant in the diffusion space for annotation and then propagates these labels across the image. The proposed active learning approach significantly enhances material discrimination in hyperspectral images compared to fully unsupervised clustering methods. This efficient representative sample selection strategy in the segmentation field is worth adopting in the object detection domain. Yu et al. [35] introduced a consistency-based AL method that leverages model predictions across different augmentations to identify informative samples for labeling. Liang et al. [36] presented mixed uncertainty sampling with class distribution balancing (MUS-CDB), a mixed uncertainty sampling approach with class distribution balancing, specifically designed for active annotation in aerial object detection. Yang et al. [37] proposed plug and play active learning (PPAL), a two-stage method combining difficulty-calibrated uncertainty sampling and category-conditioned matching similarity. PPAL efficiently selects samples for labeling without modifying detector architectures. Ghita et al. [38] introduced ActiveAnno3D, an AL framework for multi-modal 3D object detection, which selects highly informative samples for labeling to minimize annotation costs while maintaining detection performance. In the context of AL for WSOD, Vo et al. [39] explored active learning strategies for weakly supervised object detection and proposed methods to select the most informative weak annotations to improve detection performance. Wang et al. [40] proposed active learning for weakly supervised object detection (ALWOD), a framework combining AL with weakly and semi-supervised object detection, utilizing a student–teacher disagreement mechanism to select informative samples for annotation. These works collectively demonstrate the potential of AL WSOD to bridge the gap between fully and weakly supervised object detection, achieving significant improvements in data efficiency and model performance.

In LS-WSOD, the random selection of samples for strong annotations causes uncertainty in sample quality, and active learning can address this issue. However, few active learning strategies have been proposed for weakly supervised object detection, and current strategies largely overlook the effectiveness of multi-stage fine-tuning and the issue of noise. The method proposed in this paper provides a part domination-based adaptive active learning strategy and an enhanced fine-tuning mechanism as solutions to these problems.

3. Methods

3.1. Overall Framework and Loss Functions

First, we introduce the overall framework. The proposed PDEF is trained on dataset

I

, which consists of two subsets: the weakly supervised subset

I_{α}

and the strongly supervised subset

I_{β}

. As shown in Figure 2, we first pretrain a base model,

M^{0}

, only on the weakly supervised dataset

I^{0}

, and the strongly supervised subset

I_{β}^{0}

is empty. Subsequently, through N-stage fine-tuning on a low-shot weakly supervised dataset

I^{k}

, the model is iteratively updated to

M^{k}

, which is based on

M^{0}

. In each stage, PDAAL actively selects and annotates the most informative samples from

I_{α}^{k - 1}

, while ACRT handles data augmentation and oversampling to mitigate the scarcity of labeled samples of under-performed classes. Additionally, our proposed PAC is integrated to exploit contextual cues, further enhancing low-shot detection performance. By integrating new strong annotations and refining the model, PDEF effectively leverages limited strong labels to achieve robust object detection performance in RSIs.

Second, we introduce loss functions under different types of supervision. As mentioned above, PDEF involves two training phases: pretraining using only weak supervision labels and fine-tuning under low-shot strong supervision. Below, we will explain the loss calculations in these two scenarios.

Pretraining Loss Calculation: In the pretraining phase, the loss is computed using weak labels (image-level labels) without bounding box annotations. This is based on the MIL [21] paradigm, combined with pseudo-boxes. The overall loss consists of a combination of the MIL loss and refinement loss.

1. MIL Loss: The overall loss for MIL is

L_{MIL} (I, R, q) = - \sum_{c \in C} q_{c} log (\sum_{r \in R} p_{r, c}),

(1)

where I is the input image, R is the set of region proposals, q represents the image-level class labels, C is the set of object classes, and

p_{r, c}

is the classification score for region r belonging to class c.

2. Refinement Loss: Refinement is performed iteratively using pseudo-boxes generated from the MIL output. The loss at refinement step m is

L_{w}^{(m)} (I, R, D^{(m - 1)}) = L_{cls}^{(m)} (I, R, D^{(m - 1)}) + L_{reg}^{(m)} (I, R, D^{(m - 1)}),

(2)

where

D^{(m - 1)}

represents pseudo-boxes generated in the previous refinement step.

L_{cls}

is the classification loss, and

L_{reg}

is the bounding box regression loss.

3. Total Weak Supervision Loss: The overall loss for weak supervision is

L_{w} = L_{MIL} + \sum_{m = 1}^{M} L_{w}^{(m)},

(3)

where M is the total number of refinement steps. We adopt

M = 3

, in alignment with most WSOD method settings.

Fine-tuning Loss Calculation: In the fine-tuning phase with low-shot strong supervision (bounding box annotations available), the loss is calculated using ground-truth bounding boxes instead of pseudo-boxes.

1. Refinement Loss with Strong Annotations: Using ground-truth bounding boxes G, the loss for the refinement step m is

L_{s}^{(m)} (I, R, G) = L_{cls}^{(m)} (I, R, G) + L_{reg}^{(m)} (I, R, G) .

(4)

2. Total Strong Supervision Loss: The combined loss for strong supervision becomes

L_{s} = L_{MIL} + \sum_{m = 1}^{M} L_{s}^{(m)} .

(5)

3.2. Part Domination-Based Adaptive Active Learning (PDAAL)

The purpose of PDAAL is to actively and adaptively select the most informative samples, which helps to alleviate part domination and class imbalance. PDAAL identifies hard samples in the training set

I_{α}^{k - 1}

after completing the previous training stage using the model

M^{k - 1}

. Specifically, these hard samples exhibit a part domination pattern in their predictions. PDAAL selects these hard samples in a class-balanced manner.

PDAAL is described in the following content, as well as in Algorithm 1. Given a dataset

I_{α}^{k - 1} = (I_{α, 1}^{k - 1}, I_{α, 2}^{k - 1}, I_{α, 3}^{k - 1}, \dots, I_{α, n}^{k - 1})

, the corresponding detection results are represented as

D^{k - 1} = (D_{1}^{k - 1}, D_{2}^{k - 1}, D_{3}^{k - 1}, \dots, D_{n}^{k - 1})

. For each sample

I_{α, i}^{k - 1}

, we analyze

D_{i}^{k - 1}

to determine whether it contains a part domination pattern.

Algorithm 1 Partdomination-based adaptive active learning (PDAAL)

Require:: Dataset $I_{α}^{k - 1} = (I_{α, 1}^{k - 1}, I_{α, 2}^{k - 1}, \dots, I_{α, n}^{k - 1})$ with corresponding detection results $D^{k - 1} = (D_{1}^{k - 1}, D_{2}^{k - 1}, \dots, D_{n}^{k - 1})$ , model $M^{k - 1}$ , adaptive thresholds $t_{ada, c}$ , selection budget B.
1:: Initialize $P \leftarrow \emptyset$
2:: for each sample $I_{α, i}^{k - 1}$ in $I_{α}^{k - 1}$ and each $D_{i}^{k - 1}$ in $D^{k - 1}$ do
3:: for each pair of detection boxes $(d_{p}, d_{q})$ in $D_{i}^{k - 1}$ such that $area (d_{p}) < area (d_{q})$ do
4:: if $d_{p}$ and $d_{q}$ satisfy the following conditions: then
5:: ▹ Condition 1: $d_{p}$ and $d_{q}$ belong to the same category c.
6:: ▹ Condition 2: $score (d_{p}), score (d_{q}) \geq t_{ada, c}$ .
7:: ▹ Condition 3: $area (d_{q}) / area (d_{p}) \geq μ$ .
8:: ▹ Condition 4: $intersection (d_{p}, d_{q}) / area (d_{p}) \geq δ$ .
9:: Add $I_{α, i}^{k - 1}$ to $P$ .
10:: end if
11:: end for
12:: end for
13:: Using the least-confidence method [41], select the top-B samples from $P$ as $P_{B}$ .
14:: Remove the selected samples $P_{B}$ from $I_{α}^{k - 1}$ : $I_{α}^{k} \leftarrow I_{α}^{k - 1} ∖ P_{B}$ .
15:: Annotate and add the selected samples to $I_{β}^{k - 1}$ : $I_{β}^{k} \leftarrow I_{β}^{k - 1} \cup P_{B}$ .
16:: return $I_{α}^{k}$ , $I_{β}^{k}$

Let

D_{i}^{k - 1}

contain

n_{\det, i}

detection results, where the j-th detection box is denoted by

d_{i j}

. If two detection boxes satisfy the requirements for the part domination pattern, they are defined as

d_{p}

and

d_{q}

, where the area of

d_{p}

is smaller than that of

d_{q}

. These two boxes are considered to form a part domination pattern if the following conditions are met:

1.

d_{p}

and

d_{q}

belong to the same category c;

2.

score (d_{p}), score (d_{q}) \geq t_{ada, c}

;

3.

area (d_{q}) / area (d_{p}) \geq μ

, where

μ = 3

;

4.

intersection (d_{p}, d_{q}) / area (d_{p}) \geq δ

, where

δ = 0.8

.

To achieve class-balanced sampling and prevent the excessive selection of samples from high-proportion categories, we introduce

t_{ada, c}

to adaptively filter the detection results. The computation of

t_{ada, c}

is shown in Equation (6). It can be observed that the higher the proportion of a category in the dataset, the larger its

t_{ada, c}

, resulting in fewer samples eligible for part domination pattern computation.

t_{ada, c} = 0.5 + λ \cdot \frac{1 - e^{- r_{c}}}{1 - e^{- 1}},

(6)

where

r_{c} = \frac{n_{c}}{N_{\max}}

is the ratio of the sample count

n_{c}

of category c to the maximum sample count

N_{\max}

across all categories, and

λ

is 0.3, which is selected based on a hyperparameter experiment. The normalized exponential method is introduced to

t_{ada}

so that classes with high proportions quickly receive higher confidence penalties.

Subsequently, samples that satisfy the part domination pattern are grouped into a set,

P

. Using the least-confidence method [41], the top-B samples are selected from

P

for strong label annotation. These annotated samples are added to the set

I_{β}^{k - 1}

to form

I_{β}^{k}

.

3.3. Parameter-Efficient Attention for Context (PAC)

The PAC module captures spatial correlations by introducing context refinement along both the horizontal and vertical dimensions, which is inspired by [42,43]. In contrast to previous attention-based approaches that focus primarily on channel-wise gating or standard spatial transformations, PAC integrates context via average pooling and local cues through 1D depthwise convolutions [44]. This design provides a lightweight yet effective mechanism that can be seamlessly inserted into a convolutional backbone with negligible computational overhead. Moreover, PAC notably enhances the proposed PDEF framework by selectively strengthening or suppressing salient spatial features, thereby improving its performance under diverse visual conditions.

As shown in Figure 2, given an input feature map

X \in R^{C \times H \times W},

the PAC module first employs a

7 \times 7

average pooling layer:

X_{avg} = {AvgPool}_{7 \times 7} (X),

(7)

where the kernel size, stride, and padding are configured to maintain the original spatial resolution. Next, a

1 \times 1

convolution refines the pooled features:

X_{1 \times 1} = {Conv}_{1 \times 1} (X_{avg}) .

(8)

To extract horizontal and vertical contextual cues,

X_{1 \times 1}

is passed through two 1D depthwise convolutions:

X_{h} = HConv (X_{1 \times 1}),

(9)

X_{v} = VConv (X_{h}),

(10)

where

HConv

adopts a kernel size of

(1, K_{h})

and

VConv

adopts

(K_{v}, 1)

, where

K_{h}

and

K_{v}

are 13. The depthwise convolutions (DWConvs) have fewer parameters than standard convolution, and 1D strip convolution further reduces the parameter count compared to 2D convolution. This makes the module parameter-efficient, computationally efficient, and easy to train during fine-tuning. Additionally, average pooling combined with 1D strip DWConv enables the model to capture richer contextual information, alleviating context confusion and accelerating the correction of cognitive biases in the model. Both DWConvs use a group setting equal to the number of channels, preserving the channel dimension. Finally, another

1 \times 1

convolution is applied, followed by a sigmoid function:

A = σ ({Conv}_{1 \times 1} (X_{v})),

(11)

producing the attention factor

A

. In practice,

A

can be multiplied element-wise with the original feature map

X

(or a subsequent feature tensor) to emphasize or attenuate spatial regions in a context-aware manner. By combining horizontal and vertical cues in a channel-preserving manner, PAC provides an efficient means to improve feature representation quality in PDEF, which alleviates context confusion.

3.4. Adaptive Category Resampling for Tuning (ACRT)

The goal of ACRT is to enhance the error-correction capability of the original WSOD model during the fine-tuning phase using strong labels without significantly increasing training costs. Due to the absence of instance-level labels in WSOD tasks, the initial model

M^{0}

may develop substantial cognitive biases. Such issues are particularly prominent in RSIs due to several inherent challenges in remote sensing images. The challenges include the wide range of object scales, the complexity of background information, and the varied orientations of objects caused by different viewing angles. On the one hand, brief fine-tuning may not effectively correct such excessive noise, especially for under-performed categories and scenarios. On the other hand, prolonged fine-tuning may lead to the overfitting of the model to classes that already perform well. This limitation can hinder the fine-tuning process, preventing the fully supervised samples from providing maximum benefit during training. ACRT employs adaptive class resampling to address these issues.

Before commencing fine-tuning at stage k, ACRT evaluates the performance of

M^{k - 1}

on the training set

I_{β}^{k - 1}

. It selects the bottom 10% of categories for which the model’s average precision (AP) is more than 50% lower than its best performance. For these categories, ACRT performs data augmentation-based resampling of the strong label data in

I_{β}^{k}

. The augmentation techniques include brightness stretching, rotation, blurring, color transformation, and random cropping; three of these methods are randomly selected in each resampling. If no categories meet the criteria, ACRT refrains from additional resampling to avoid unnecessary training costs.

Let

{AP}^{k - 1} (c) = AP of model M^{k - 1} on category c using I_{β}^{k - 1}, c \in {1, 2, \dots, N} .

{AP}_{best}^{k - 1} = max_{1 \leq c \leq N} {AP}^{k - 1} (c) .

(12)

We introduce a set of categories for which the model performs poorly:

Ω^{k - 1} = \{c | {AP}^{k - 1} (c) < 0.5 \times {AP}_{best}^{k - 1}\} .

(13)

Among the categories in the bottom 10%, we include only those that satisfy the above criterion to form the final

Ω^{k - 1}

. For simplicity, we assume that

Ω^{k - 1}

already represents the exact set of categories to be enhanced.

If

Ω^{k - 1} \neq ⌀

, we perform data augmentation on the images of those categories in the training set

I_{β}^{k}

. Let the set of possible augmentations be

T = {B, R, L, C, S},

(14)

where, for example,

\begin{matrix} B & : brightness stretching; \\ R & : rotation; \\ L & : blurring; \\ C & : color transformation; \\ S & : random cropping . \end{matrix}

For each image I, three different augmentation operations are randomly selected from

T

; this selection operation is denoted by

RandomSubset (T, 3)

. For an image x, the data augmentation function is defined as follows:

Aug (I) = {f_{t_{1}} (I), f_{t_{2}} (I), f_{t_{3}} (I)}, {t_{1}, t_{2}, t_{3}} \sim RandomSubset (T, 3),

(15)

where

f_{t}

represents the operator that applies augmentation t to the image.

Hence, for each category

c \in Ω^{k - 1}

,

\forall (I, G) \in I_{β}^{k} |_{c} ⟼ (Aug (I), G) .

(16)

All newly augmented samples are combined with the original training data to form

{\hat{I}}_{β}^{k} = I_{β}^{k} \cup ⋃_{c \in Ω^{k - 1}} \{(Aug (I), G) | (I, G) \in I_{β}^{k} |_{c}\} .

(17)

If

Ω^{k - 1} = ⌀

, no additional resampling will be performed. Finally, the updated training set

{\hat{I}}_{β}^{k}

is used to perform fine-tuning at stage k:

M^{k} = FineTune (M^{k - 1}, {\hat{I}}_{β}^{k}) .

(18)

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

To evaluate the performance of PDEF, we conducted experiments using two widely recognized datasets in the field of WSOD in RSIs: NWPU VHR-10.v2 and DIOR. These datasets provide comprehensive scenarios to rigorously test the performance of PDEF across diverse remote sensing contexts.

The NWPU VHR-10.v2 (Northwestern Polytechnical University Very High Resolution) dataset is derived from the NWPU VHR-10 dataset [18], designed for object detection tasks in high-resolution remote sensing images. It comprises training and testing sets of 879 and 293 images, respectively, and 3599 annotated object instances with horizontal bounding boxes across 10 categories. The dataset’s images are derived from various high-resolution sources and feature diverse backgrounds and environmental conditions. NWPU VHR-10.v2 was derived from NWPU VHR-10 by cropping the images to a fixed size (400 × 400 pixels), and it is widely used for weakly supervised object detection algorithms in RSIs.

The DIOR (object detection in optical remote sensing) dataset is a large-scale benchmark designed to address the challenges of object detection in remote sensing images [19]. It contains 11,725 images for training, 11,738 images for testing, and 192,472 annotated object instances with horizontal bounding boxes across 20 categories. The dimensions of the images are fixed at 800 × 800 pixels. DIOR contains images with diverse conditions, weather, seasons, and spatial resolutions, offering high inter-class similarity and intra-class variability. The dataset also features a wide range of object scales, making it ideal for evaluating algorithm performance for targets with significant size variations. DIOR serves as a comprehensive resource for the development and validation of deep learning-based object detection methods for RSIs.

Average precision (AP) and the mean of AP (mAP) are adopted as evaluation metrics to illustrate the detection performance of our PDEF and other detection models. AP and mAP are defined as follows:

Precision = \frac{T P}{T P + F P}

(19)

Recall = \frac{T P}{T P + F N}

(20)

$T P$ (true positive): The count of actual positive samples that were accurately classified as positive.
$F P$ (false positive): The count of actual negative samples that were mistakenly classified as positive.
$F N$ (false negative): The count of actual positive samples that were incorrectly classified as negative.

AP = \sum_{n} (R_{n} - R_{n - 1}) \cdot P_{n}

(21)

where

R_{n}

and

P_{n}

are the recall and precision at the n-th threshold, respectively. Since both OICR [2] and multiple instance self-training (MIST) [21] were evaluated using the VOC07 [45] protocol, which calculates AP using an 11-point interpolation method, we also followed this discrete AP calculation method in our experiments.

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(22)

where N is the number of categories in the dataset, and

{AP}_{i}

is the average precision for the i-th category.

4.1.2. Implementation Details

Experimental environment: Table 1 presents the hardware and software configurations used for our experiments, including an i7-13700KF CPU, 16 GB DDR5 RAM, and an RTX 4090 GPU running on Ubuntu 22.04.4. We implemented our models in Python 3.8 using PyTorch 2.2, with CUDA 11.8 providing GPU acceleration.

Network architecture: VGG16 [46] serves as the backbone of our model, as it is utilized in the majority of WSOD and LS-WSOD algorithms. Given the outstanding performance and stability of MIST [21] in WSOD tasks, we selected MIST as the baseline model. Furthermore, the PAC module is integrated before the model’s neck to enhance the feature maps, thereby improving the model’s ability to capture contextual information.

Training and testing setup: We used the official splits provided by the NWPU VHR-10.v2 and DIOR datasets to partition the data into training, validation, and test sets. For data preprocessing and augmentation, samples were resized to scales of (480, 576, 688, 864, 1000, 1200), and horizontal flipping augmentation was applied to improve the robustness and performance of PDEF. We employed the stochastic gradient descent (SGD) optimizer with a learning rate of

5 \times 10^{- 5}

. The SGD optimizer incorporates a warm-up phase followed by decay, and a momentum of 0.9 to stabilize and accelerate training. The batch size was 1. The fine-tuning process was divided into five stages. As illustrated in Section 3.2, PDEF selects B samples for strong label annotation at each stage. The budget B is set to 20 for NWPU VHR-10.v2 and 40 for DIOR. The fine-tuning step size is determined by the number of strong labels, with each strong label sample undergoing 300 iterations. During inference, non-maximum suppression (NMS) is applied to prevent the model from introducing excessive boxes. All experiments were repeated three times on different base models to ensure robustness, and the results were averaged to mitigate randomness.

4.2. Ablation Study

An ablation study was carried out with different module combinations to validate the effectiveness of our proposed modules: PDAAL, PAC, and ACRT. Additionally, we compared PDAAL with the random sampling method; in the attention module, we compared PAC with CBAM [16]. The experiments were performed on the NWPU VHR-10.v2 and DIOR datasets using five base models, and the average results are reported in Table 2 and Table 3.

In the experiments on the NWPU VHR-10.v2 dataset, method A, employing only a random sampling strategy, served as the baseline model, which resulted in an mAP of 76.2%, showing relatively weak performance. Method B (PDAAL only) incorporates PDAAL as the active sampling strategy, which raises the mAP to 77.6%, highlighting PDAAL’s significant contribution to hard sample selection and balanced sampling. Methods D (PAC only) and E (ACRT only) introduce the PAC and ACRT modules, respectively, improving the mAP to 77.4% and 77.5%. This demonstrates that PAC enhances contextual understanding, while ACRT excels in mining samples of under-performed classes. Experiments combining multiple modules further reveal substantial synergy: method F (PDAAL + PAC) and method G (PDAAL + ACRT) achieve mAPs of 78.5% and 78.7%, respectively, showcasing the strong collaborative potential of PDAAL with other modules. Finally, method I (PDAAL + PAC + ACRT) achieves the highest mAP of 79.8% by integrating all modules, validating the complementary nature of the modules and their effectiveness in joint optimization.

The ablation study on the DIOR dataset revealed a similar trend to that on NWPU VHR-10.v2. The baseline model, method A, achieves an mAP of 41.5%. When adding PDAAL alone (method B), the mAP increases to 43.3%, underscoring PDAAL’s versatility and effectiveness across different datasets. Methods D (PAC only) and E (ACRT only) raise the mAP to 42.4% and 42.9%, respectively, confirming the independent contributions of PAC and ACRT. Methods F (PDAAL + PAC) and G (PDAAL + ACRT) reach mAPs of 44.0% and 44.5%, demonstrating the superior performance of module collaboration. Finally, method I (PDAAL + PAC + ACRT) achieves the highest mAP of 45.4%, significantly outperforming other combinations, indicating the comprehensive effectiveness of PDAAL, PAC, and ACRT.

A comparison of methods C (CBAM only) and D (PAC only) shows that, on the NWPU dataset, method C achieves an mAP of 76.7%, while method D achieves 77.4%. On the DIOR dataset, method C reaches an mAP of 42.1%, compared to 42.4% for method D. This comparison suggests that PAC outperforms CBAM in feature extraction, especially on the NWPU dataset. Moreover, the advantages of PAC are further validated in multi-module combinations, such as method H (PAC + ACRT). Method H achieves mAPs of 78.5% and 43.6% on the NWPU and DIOR datasets, respectively, significantly surpassing the independent performance of method C. Although CBAM contributes to feature attention allocation, its performance improvement is limited compared to PAC, indicating that PAC is better suited for the feature requirements of remote sensing images and the field of LS-WSOD fine-tuning.

We also employed the t-test to verify the differences resulting from the incorporation of our modules. Using the performance metrics obtained on the NWPU VHR-10.v2 and DIOR datasets, we performed t-tests comparing method set A (baseline) and method sets B (PDAAL only), D (PAC only), E (ACRT only), and I (PDAAL + PAC + ACRT). The resulting p-values were all below 0.05, indicating that each of our modules, whether integrated individually or combined, clearly improves performance. However, when performing a t-test comparing method sets C (CBAM only) and D (PAC only), the p-values obtained on NWPU VHR-10.v2 and DIOR are 0.12 and 0.32, which are over 0.1. This is primarily because CBAM itself provides some improvement (+0.5% and +0.6%), which results in a smaller performance gap between sets C and D. Moreover, the t-test requires more experimental repetitions for clearer differentiation. Nevertheless, our PAC module achieves a greater performance improvement (+1.2% and +0.9%) and generates superior attention maps compared to CBAM, as further discussed in Section 4.4.2.

4.3. Comparisons

4.3.1. Comparisons with the State of the Art

Table 4 shows the mAPs of various weakly supervised and low-shot weakly supervised object detection methods on the NWPU VHR-10.v2 dataset. “WSOD” refers to methods that use only image-level labels (such as WSDDN, OICR), while “LS-WSOD” denotes methods that use image-level labels and 200 additional fully annotated samples (e.g., BCNet, ITNet, and Ours).

From Table 4, we see that even the best WSOD methods (such as IMIN) achieve an mAP of only 65.2%, indicating room for improvement. By contrast, the low-shot weakly supervised methods BCNet and ITNet reach 70.5% and 75.3% mAP, respectively, far exceeding the performance of the WSOD approaches. This shows that a small number of fully annotated data samples play a vital role in boosting detection performance.

With the same low-shot setting (Budget 100), our PDEF achieves an mAP of 79.8%, improving upon BCNet and ITNet by 9.3% and 4.5%, respectively. The per-class results show that our approach has clear advantages in multiple categories. For example, its AP rises from 15.7% (ITNet) to 24.8% for the Bridge category and from 78.0% to 91.4% for Harbor. The AP for the Vehicle category also increases by 12.1%.

Table 5 presents the detection results of both the WSOD and LS-WSOD methods on the DIOR dataset, which contains 20 categories. For easier viewing, Table 5 is split into two parts: the first 10 categories are shown in the top half, and the last 10 categories plus the final mAP are shown in the bottom half.

In the “WSOD” section, classical WSOD methods (such as WSDDN, OICR, and TCANet) have relatively low overall performance on DIOR, with most mAP values being around 20–30%. The highest in this group is IMIN, reaching only 29.1%. When we add 200 fully labeled images, BCNet and ITNet show improved mAPs of 36.1% and 42.3%, respectively, demonstrating the importance of limited fully annotated data for complex object detection tasks in RSIs.

In comparison, our PDEF reaches an mAP of 45.4%, which is 9.3% and 3.1% higher than those of BCNet and ITNet, respectively. Our method also performs better on most single categories. Our improvements are especially noticeable for Overpass, Ship, Storage Tank, and Vehicle—categories that are often more complex or confusing. For instance, the AP for Overpass increases from 26.8% (BCNet) or 30.5% (ITNet) to 40.8%, while that for Ship jumps from 21.0% or 26.7% to 58.2%. These findings show that combining a small number of fully labeled samples with an improved weakly supervised detection strategy can effectively handle diverse and complicated targets in RSIs.

Observing the mAP results on NWPU VHR-10.v2 and DIOR, we can further analyze the contributions of the different modules. First, ACRT performs adaptive resampling for low-performance classes during fine-tuning, thereby improving performance for those classes. Compared to LS-WSOD methods such as ITNet, PDEF shows improvements for classes with relatively low performance, such as Bridge in NWPU VHR-10.v2 and Bridge, Dam, and Train Station in DIOR. Second, the contribution of PAC to performance improvements in classes with severe context confusion, such as Bridge, Dam, and Overpass, is also significant. By focusing on generating attention weights for contextual information, PAC enables the model to better distinguish between objects in these classes and their accompanying backgrounds. However, despite PDEF achieving considerable improvements in detection performance compared to previous work, there remains room for further enhancement. Most of the failed detections arise from low-resolution samples, in which the objects are too small and the backgrounds are too extensive. Therefore, in future work, we can focus on multiscale feature fusion or introduce a scale classifier for remote sensing images to improve how the model handles such samples. In summary, on both the NWPU VHR-10.v2 and DIOR datasets, our PDEF outperforms not only WSOD approaches but also state-of-the-art LS-WSOD methods using the same weak labels plus a limited strong label budget. These results confirm that LS-WSOD can lead to higher performance and demonstrate the effectiveness of our proposed approach.

4.3.2. Comparison of Different Active Learning Strategies

In this part, we compare the performance of our proposed PDEF with that of other active learning strategies. To ensure fairness, we used MIST as the baseline model and evaluated both the standalone PDAAL method and PDAAL + PAC + ACRT (ensemble) in comparison with other active learning methods combined with MIST. The active learning methods included in the comparison are as follows: (1) u-random, which performs uniform random sampling for strong label annotation; (2) b-random, which performs balanced sampling to ensure uniform selection across all classes; (3) entropy-sum, which calculates the total entropy of the predicted boxes in each image to determine sample uncertainty and selects the most uncertain samples; (4) entropy-max, which calculates the maximum entropy of the predicted boxes in each image to determine sample uncertainty and selects the most uncertain samples; (5) core-set, which follows the approach described in [50] for sample selection based on a core-set strategy; (6) core-set-ent, which follows the core-set approach described in [51] with additional strong label annotations; and (7) box-in-box (BiB), which follows the active strategy proposed by [39]. All experiments were conducted on three different base models, and the average results across the three experiments are reported.

As shown in Figure 3, on the NWPU VHR-10.v2 dataset, our proposed PDEF (ensemble) method demonstrates superior performance, consistently leading across all stages. Its mAP steadily improves with each stage, showcasing excellent overall performance. In comparison, the PDAAL method, though slightly behind PDEF, exhibits a very stable improvement trend.

As shown in Figure 4, our PDEF also performs exceptionally well on the DIOR dataset. In the first stage, its mAP is significantly higher than those of other methods and continues to increase substantially with each stage, ultimately reaching the highest level. PDAAL’s performance on the DIOR dataset is similar to its results on the NWPU VHR-10.v2 dataset. Although slightly inferior to PDEF, its improvement trend remains consistent and steady. Alongside PDEF, it significantly outperforms other methods across all stages, demonstrating its ability to maintain efficient learning and highlighting its robustness and broad applicability.

The significant performance improvements seen in the ensemble and PDAAL methods arise because PDAAL strategy effectively adapts to the LS-WOSD task in RSIs. PDAAL can efficiently select challenging samples exhibiting the part domination pattern at each stage, while also accounting for class imbalance. As a result, given the same budget, PDAAL better addresses issues related to part domination and class imbalance compared to other strategies. Meanwhile, the ensemble approach incorporates PAC and ACRT, placing emphasis on mitigating context confusion and model noise, thus further enhancing the model’s ability to utilize the samples selected by PDAAL. These methods complement each other to jointly enhance model performance. Overall, the PDEF method achieves the best performance on both datasets, reflecting its strong active learning selection strategy, excellent feature-learning capability, and superior fine-tuning for base model correction. PDAAL, as an effective active learning strategy, ranks just behind PDEF in performance. Both methods exhibit significant performance improvements across multiple stages, validating their application value in active learning. Specifically, the PDEF method is ideal for scenarios requiring the highest performance, while PDAAL, through comparative analysis, is proven to be an efficient active learning strategy tailored for LS-WSOD in RSIs.

4.3.3. Comparison of Computational Performance

Table 6 summarizes the inference efficiency of our proposed PDEF model compared with four state-of-the-art weakly supervised object detection methods on the DIOR test dataset. We compare the methods in terms of mAP, the number of parameters (in millions), floating-point operations (FLOPs, in billions), and latency (second). Among these methods, PDEF achieves the highest detection accuracy at 45.4% mAP. In terms of model complexity, PDEF contains 135.5 M parameters, which is similar to BCNet and ITNet (134.7M). PDEF does not have an advantage from the perspective of FLOPs and latency, being 2.45× and 1.62× those of OICR, respectively. However, OICR is a WSOD method and does not incorporate instance-level labels, resulting in significantly lower mAP performance (16.5%). Our method’s FLOPs and latency are 1.13× and 1.14× those of the LS-WSOD methods (BCNet and ITNet), respectively. On this basis, our method achieves a clear performance improvement, making the computational cost of inference acceptable. To reduce the computational cost during inference, future research can focus on retraining with lightweight models. Overall, our approach effectively balances accuracy and computational overhead, highlighting its potential for large-scale, real-world object detection tasks under low-shot weak supervision.

Another point worth mentioning is the training cost of PDEF. Based on the experimental environment in Section 4.1.2, training the pretraining model with PDEF on the DIOR dataset takes approximately 13 h, while fine-tuning with low-shot strong labels requires an additional 7 h (including time for PDAAL sample selection and ACRT performance evaluation). Active learning selects samples to annotate by measuring model uncertainty and analyzing output distributions, which inevitably increases computational overhead. However, active learning enhances the cost-effectiveness of manual labeling. With advancements in active learning techniques and hardware computing performance, active learning will become increasingly important and practical.

4.3.4. Comparison Between Image Budget and Box Budget

In this part, we discuss the impact of using different units as the budget in active learning. In traditional LS-WSOD tasks, the budget is typically based on the number of images, such as 200 in total or 40 images per stage for strong label annotation. However, the number of objects in different samples varies significantly. For instance, in the DIOR training dataset, the number of objects per sample ranges from 1 to about 30, with an average of 5.8. If certain active learning strategies tend to select samples with more objects for strong label annotation, the model will gain access to more trainable positive samples using the same image budget. This could lead to an unfair comparison.

To address this, we conducted experiments using the number of ground-truth bounding boxes as the budget instead of the number of images. Using MIST as the base model, we tested various active learning methods on the DIOR dataset. In each experiment, we set the number of stages to 5. Referring to the average number of objects per sample in the DIOR training set, we set the bounding box budget per stage to 240. Table 7 reports the number of box or image samples that each active learning method can select per stage, as well as the corresponding mAP performance in both the box-based and image-based budget settings.

The results show that, compared to the image-based budget, all methods experience performance degradation when using the box-based budget. Notably, the entropy-sum and core-set methods exhibit the largest performance drops, decreasing by 3.2% and 2.5%, respectively. It is also evident that, under the box-based budget constraint, these two methods select the fewest image samples per stage, as they tend to prefer samples with more objects.

In contrast, although our PDAAL method also encounters a performance decline, the decrease is relatively limited. PDAAL remains the best-performing method among all active learning strategies. This further demonstrates the effectiveness and robustness of our PDAAL approach, showing that it can identify the most challenging samples while maintaining a more balanced selection of samples.

4.4. Qualitative Results

4.4.1. Strong Samples Selected by PDAAL

Figure 5 illustrates some samples selected by PDAAL on the DIOR dataset, along with the corresponding inference results of the model at the last stage. It can be observed that PDAAL successfully selects many part-dominated samples. For instance, the object is dominated by parts of an airplane fuselage or portions of a golf course. This demonstrates that our proposed PDAAL method effectively identifies the most challenging samples in WSOD tasks for strong label annotation, thereby maximizing the efficiency of manual annotation.

4.4.2. Attention Maps

Figure 6 shows spatial attention maps obtained by employing CBAM or PAC as the attention module. The examples include overpasses, tennis courts, airplanes, and ships combined with harbors as detection targets. The comparison clearly demonstrates that our proposed PAC module produces more precise attention maps, whereas the attention generated by CBAM tends to aggregate broadly, covering significant portions of irrelevant background regions. By contrast, PAC can more effectively distinguish between foreground and background features, thereby alleviating context confusion.

This advantage arises because PAC’s 1D convolution modules capture fine-grained features, allowing the model to more accurately focus on object regions. While CBAM generates coarse-grained and broad attention aggregation. Furthermore, PAC uses

1 \times 1

convolutions to effectively fuse channel information, followed by horizontal and vertical spatial convolutions, efficiently capturing local spatial contextual relationships. Capturing spatial context further improves the ability to distinguish objects from backgrounds, reducing irrelevant background responses.

4.4.3. Results on NWPU VHR-10.v2 and DIOR

Figure 7 presents the object detection results of our proposed PDEF on the NWPU VHR-10.v2 dataset compared with the baseline MIST. From left to right, the detection effects of four object categories—Airplane, Baseball Field, Bridge, and Ship—are displayed. The comparison shows that PDEF demonstrates superior performance in object detection. First, for airplane detection, PDEF aligns more accurately with the ground-truth bounding boxes while avoiding some false positives observed in the baseline method. For baseball field detection, PDEF also achieves closer alignment with the ground truth, with more precise bounding in fine-grained regions such as the infield and outfield. Regarding bridges and ships, PDEF reduces the excessive redundant boxes in the baseline results, showcasing stronger localization capabilities and boundary consistency. This indicates that the PDEF method is more robust in handling complex backgrounds and occluded scenes.

Similarly, Figure 8 illustrates the detection results on the DIOR dataset. From left to right, the detection effects of four object categories—Baseball Field and Basketball Court, Ship, Dam, and Chimney—are shown. It can be observed that PDEF is closer to the ground truth in most scenarios, demonstrating higher detection recall. For baseball field and basketball court detection, PDEF exhibits stronger separation capabilities for multiple objects, such as basketball and tennis courts, avoiding the excessive overlap of boxes seen in the baseline method. In ship detection, PDEF significantly reduces the false negatives and false positives in the baseline results, especially in densely arranged ship areas, where the bounding boxes capture more objects and are clearer. For dams, PDEF better preserves the integrity of the objects, whereas the baseline method shows imprecise bounding box edges. Finally, in chimney detection, the PDEF method provides finer predictions of object boundaries, avoiding the part-dominated boxes seen in the baseline results and demonstrating its superiority. This shows that PDEF achieves significant improvements in robustness and performance on the DIOR dataset.

The detection result comparisons on NWPU VHR-10.v2 and DIOR reveal some phenomena that are worth further in-depth investigation. First, we can clearly observe that PDEF effectively reduces false positives in the background. This is because the introduction of PAC enhances the model’s understanding of context, increasing its ability to distinguish between foreground and background. Second, the part domination phenomenon is significantly alleviated; compared with the baseline, the proposed model can detect objects more completely instead of only detecting parts of objects. This is because the introduction of PDAAL allows the model to sufficiently learn from samples exhibiting the part domination pattern. As a result, PDEF greatly reduces the occurrence of missed detections, as ACRT can dynamically address insufficient training for specific classes during fine-tuning and reinforce training for those classes. Furthermore, the high-quality strong label samples provided by PDAAL and feature enhancement from PAC yield true positive detection results with higher confidence.

4.5. Hyperparameters

Table 8 and Table 9 present the results of hyperparameter experiments for the PDAAL and PAC modules, aiming to determine the optimal hyperparameter combination. The DIOR dataset was used in these experiments, during which only the relevant module (PDAAL or PAC) was included in MIST. From the results, the optimal hyperparameters are found to be

λ = 0.3

,

μ = 3

,

δ = 0.8

, and

K_{h} = K_{v} = 13

.

5. Conclusions

In this article, we proposed PDEF, a novel active learning framework for LS-WSOD in RSIs. To address the prominent challenges in WSOD and LS-WSOD for RSIs—such as part domination, context confusion, imbalance, and noise—we introduce PDAAL, PAC, and ACRT as solutions. Specifically, PDAAL is designed to mitigate the part domination issue through actively and evenly sampling challenging instances for annotation. PAC incorporates an efficient parametric module to focus on contextual information during fine-tuning, alleviating context confusion and accelerating the correction of errors in the original model. ACRT evaluates the model’s detection performance for each category during training and resamples challenging categories to correct biases in the model’s understanding of under-performed classes. Our PDEF achieves state-of-the-art mAP performance on NWPU VHR-10.v2 (79.8%) and DIOR (45.4%), outperforming the previous state-of-the-art method by 4.5% and 3.1%, respectively. Extensive experiments demonstrate the effectiveness of our proposed PDEF framework.

Furthermore, PDEF provides new insights and paradigms for practical remote sensing image object detection tasks. In real-world RSI engineering scenarios, dataset processing typically involves cleaning, slicing, and progressive annotation. After slicing, image-level annotations can be easily obtained for WSOD tasks, while during progressive annotation, a combination of abundant image-level and limited instance-level annotations can be utilized for LS-WSOD tasks. This approach enables the development of high-performing models in the early stages of the task, facilitating a soft launch. Additionally, incorporating active learning strategies can maximize the annotation efficiency and benefits.

Despite these encouraging results, several limitations remain. First, the proposals used in our method are still generated by traditional approaches, such as selective search [52]. These methods produce an excessive number of proposals with insufficient precision, thereby affecting both efficiency and performance. Second, the proposed LS-WSOD method is unable to predict rotated bounding boxes, which hampers its detection performance for densely arranged objects in remote sensing images.

One promising avenue for future research is to optimize the quality and quantity of proposals to enhance the efficiency and performance of LS-WSOD methods. For example, the SAM can be used to generate the proposals [6]. Another potential direction lies in exploring rotated bounding box detection to improve the detection performance for elongated and densely arranged objects, making it more suitable for object detection in RSIs. By exploring the aforementioned methods, we aim to move the LS-WSOD approach toward a more practical stage and further close the gap with fully supervised object detection performance in remote sensing images.

Author Contributions

Conceptualization, P.L. and T.J.; methodology, P.L. and B.H.; supervision, B.H., T.J. and H.L.; writing—original draft, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The NWPU VHR-10.v2 dataset is available at https://drive.google.com/drive/folders/1uYEuJIqCfkzkvmum0pLYVSFBbx6hZe1_ (accessed on 15 January 2025). The DIOR dataset is available at https://drive.google.com/drive/folders/1ki6FXvTMDJrx5l66GG_85vfMkF5HLUKd (accessed on 15 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bilen, H.; Vedaldi, A. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2846–2854. [Google Scholar]
Tang, P.; Wang, X.; Bai, X.; Liu, W. Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2843–2851. [Google Scholar]
Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; Yuille, A. Pcl: Proposal cluster learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 176–191. [Google Scholar] [CrossRef] [PubMed]
Wan, F.; Wei, P.; Han, Z.; Jiao, J.; Ye, Q. Min-Entropy Latent Model for Weakly Supervised Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1297–1306. [Google Scholar]
Dong, B.; Huang, Z.; Guo, Y.; Wang, Q.; Niu, Z.; Zuo, W. Boosting weakly supervised object detection via learning bounding box adjusters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2876–2885. [Google Scholar]
Zhu, L.; Zhou, J.; Liu, Y.; Hao, X.; Liu, W.; Wang, X. Weaksam: Segment anything meets weakly-supervised instance-level recognition. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7947–7956. [Google Scholar]
Han, J.; Zhang, D.; Cheng, G.; Guo, L.; Ren, J. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 2014, 53, 3325–3337. [Google Scholar]
Feng, X.; Han, J.; Yao, X.; Cheng, G. Progressive contextual instance refinement for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8002–8012. [Google Scholar]
Feng, X.; Han, J.; Yao, X.; Cheng, G. TCANet: Triple context-aware network for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6946–6955. [Google Scholar] [CrossRef]
Tan, Z.; Jiang, Z.; Guo, C.; Zhang, H. WSODet: A Weakly Supervised Oriented Detector for Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604012. [Google Scholar] [CrossRef]
Qian, X.; Lin, C.; Chen, Z.; Wang, W. SAM-Induced Pseudo Fully Supervised Learning for Weakly Supervised Object Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1532. [Google Scholar] [CrossRef]
Wang, J.; Wang, X.; Liu, W. Weakly-and semi-supervised faster R-CNN with curriculum learning. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2416–2421. [Google Scholar]
Pan, T.; Wang, B.; Ding, G.; Han, J.; Yong, J.H. Low Shot Box Correction for Weakly Supervised Object Detection. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 890–896. [Google Scholar]
Biffi, C.; McDonagh, S.; Torr, P.; Leonardis, A.; Parisot, S. Many-Shot from Low-Shot: Learning to Annotate Using Mixed Supervision for Object Detection. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–50. [Google Scholar]
Liu, P.; Pan, Z.; Lei, B.; Hu, Y. ITNet: Low-Shot Instance Transformation Network for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606513. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, Y.; Wang, H.; Katsaggelos, A. CPDR: Towards Highly-Efficient Salient Object Detection via Crossed Post-decoder Refinement. arXiv 2025, arXiv:2501.06441. [Google Scholar]
Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-Insensitive and Context-Augmented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2337–2348. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar]
Huang, Z.; Zou, Y.; Kumar, B.; Huang, D. Comprehensive attention self-distillation for weakly-supervised object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 16797–16807. [Google Scholar]
Ren, Z.; Yu, Z.; Yang, X.; Liu, M.Y.; Lee, Y.J.; Schwing, A.G.; Kautz, J. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 10598–10607. [Google Scholar]
Wang, G.; Zhang, X.; Peng, Z.; Tang, X.; Zhou, H.; Jiao, L. Absolute wrong makes better: Boosting weakly supervised object detection via negative deterministic information. arXiv 2022, arXiv:2204.10068. [Google Scholar]
Li, X.; Kan, M.; Shan, S.; Chen, X. Weakly supervised object detection with segmentation collaboration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9735–9744. [Google Scholar]
Cao, T.; Du, L.; Zhang, X.; Chen, S.; Zhang, Y.; Wang, Y.F. CaT: Weakly supervised object detection with category transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3070–3079. [Google Scholar]
Gao, W.; Chen, Y.; Peng, Y. Cascade attentive dropout for weakly supervised object detection. Neural Process. Lett. 2023, 55, 6907–6923. [Google Scholar]
Lai, Q.; Vong, C.M.; Shi, S.Q.; Chen, C.P. Towards precise weakly supervised object detection via interactive contrastive learning of context information. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 123–135. [Google Scholar]
Wang, H.; Li, H.; Qian, W.; Diao, W.; Zhao, L.; Zhang, J.; Zhang, D. Dynamic pseudo-label generation for weakly supervised object detection in remote sensing images. Remote Sens. 2021, 13, 1461. [Google Scholar] [CrossRef]
Yao, X.; Feng, X.; Han, J.; Cheng, G.; Guo, L. Automatic weakly supervised object detection from high spatial resolution remote sensing images via dynamic curriculum learning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 675–685. [Google Scholar]
Lu, J.; Hu, Q.; Zhu, R.; Wei, Y.; Li, T. AFWS: Angle-Free Weakly-Supervised Rotating Object Detection for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5645514. [Google Scholar]
Qian, X.; Wang, C.; Wang, W.; Yao, X.; Cheng, G. Complete and Invariant Instance Classifier Refinement for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5627713. [Google Scholar]
Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z. Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement. Remote Sens. 2024, 16, 1203. [Google Scholar] [CrossRef]
Yan, Z.; Liang, J.; Pan, W.; Li, J.; Zhang, C. Weakly-and semi-supervised object detection with expectation-maximization algorithm. arXiv 2017, arXiv:1702.08740. [Google Scholar]
Choi, J.; Elezi, I.; Lee, H.J.; Farabet, C.; Alvarez, J.M. Active learning for deep object detection via probabilistic modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10264–10273. [Google Scholar]
Polk, S.L.; Cui, K.; Plemmons, R.J.; Murphy, J.M. Active diffusion and VCA-assisted image segmentation of hyperspectral images. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 1364–1367. [Google Scholar]
Yu, W.; Zhu, S.; Yang, T.; Chen, C. Consistency-based active learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3951–3960. [Google Scholar]
Liang, D.; Zhang, J.W.; Tang, Y.P.; Huang, S.J. MUS-CDB: Mixed uncertainty sampling with class distribution balancing for active annotation in aerial object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613013. [Google Scholar]
Yang, C.; Huang, L.; Crowley, E.J. Plug and Play Active Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17784–17793. [Google Scholar]
Ghita, A.; Antoniussen, B.; Zimmer, W.; Greer, R.; Creß, C.; Møgelmose, A.; Trivedi, M.M.; Knoll, A.C. ActiveAnno3D—An Active Learning Framework for Multi-Modal 3D Object Detection. arXiv 2024, arXiv:2402.03235. [Google Scholar]
Vo, H.V.; Siméoni, O.; Gidaris, S.; Bursuc, A.; Pérez, P.; Ponce, J. Active learning strategies for weakly-supervised object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 211–230. [Google Scholar]
Wang, Y.; Ilic, V.; Li, J.; Kisacanin, B.; Pavlovic, V. ALWOD: Active Learning for Weakly-Supervised Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6459–6469. [Google Scholar]
Wang, D.; Shang, Y. A new active labeling method for deep learning. In Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 112–119. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 27706–27716. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Feng, X.; Yao, X.; Cheng, G.; Han, J.; Han, J. Saenet: Self-supervised adversarial and equivariant network for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610411. [Google Scholar]
Cheng, G.; Xie, X.; Chen, W.; Feng, X.; Yao, X.; Han, J. Self-guided proposal generation for weakly supervised object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625311. [Google Scholar]
Qian, X.; Huo, Y.; Cheng, G.; Gao, C.; Yao, X.; Wang, W. Mining High-Quality Pseudoinstance Soft Labels for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607615. [Google Scholar] [CrossRef]
Sener, O.; Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. arXiv 2017, arXiv:1708.00489. [Google Scholar]
Haussmann, E.; Fenzi, M.; Chitta, K.; Ivanecky, J.; Xu, H.; Roy, D.; Mittel, A.; Koumchatzky, N.; Farabet, C.; Alvarez, J.M. Scalable active learning for object detection. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1430–1435. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar]

Figure 1. Illustration of part domination and context confusion, which are two common issues in WSOD and LS-WSOD in RSIs.

Figure 2. An overview of our PDEF. PDAAL, PAC, and ACRT are introduced to the baseline framework. The training progress includes WSOD pretraining and N-stage fine-tuning.

Figure 3. The multi-stage performance of different active learning strategies on the NWPU VHR-10.v2 test dataset.

Figure 4. The multi-stage performance of different active learning strategies on the DIOR test dataset.

Figure 5. Examples of samples selected by the PDAAL strategy on DIOR. The boxes in the images represent the inference results from the previous stage of the model for these samples.

Figure 6. Attention map of CBAM and PAC.

Figure 7. Some detection results on the NWPU VHR-10.v2 test set with a total budget of 100 (79.8% mAP).

Figure 8. Some detection results on the DIOR test set with total budget of 200 (45.4% mAP).

Table 1. Platform Specifications.

Platform	Name
GPU	RTX 4090
CPU	i7-13700KF
RAM	16 GB DDR5
Operating System	Ubuntu 22.04.4
Programming Language	Python 3.8
Deep Learning Framework	PyTorch 2.2
CUDA	11.8

Table 2. Ablation study of different module sets on NWPU VHR-10.v2 test dataset.

Different Method Sets	Active Strategies		Attention		ACRT	mAP (%)
Different Method Sets	Random	PDAAL	CBAM [16]	PAC	ACRT	mAP (%)
A (base)	✓					76.2 ± 0.7
B (only PDAAL)		✓				77.6 ± 0.5
C (only CBAM)	✓		✓			76.7 ± 0.6
D (only PAC)	✓			✓		77.4 ± 0.8
E (only ACRT)	✓				✓	77.5 ± 0.4
F (PDAAL + PAC)		✓		✓		78.5 ± 0.5
G (PDAAL + ACRT)		✓			✓	78.7 ± 0.4
H (PAC + ACRT)	✓			✓	✓	78.5 ± 0.5
I (PDAAL + PAC + ACRT)		✓		✓	✓	79.8 ± 0.3

Table 3. Ablation study of different module sets on DIOR test dataset.

Different Method Sets	Active Strategies		Attention		ACRT	mAP (%)
Different Method Sets	Random	PDAAL	CBAM [16]	PAC	ACRT	mAP (%)
A (base)	✓					41.5 ± 0.6
B (only PDAAL)		✓				43.3 ± 0.4
C (only CBAM)	✓		✓			42.1 ± 0.4
D (only PAC)	✓			✓		42.4 ± 0.5
E (only ACRT)	✓				✓	42.9 ± 0.6
F (PDAAL + PAC)		✓		✓		44.0 ± 0.5
G (PDAAL + ACRT)		✓			✓	44.5 ± 0.3
H (PAC + ACRT)	✓			✓	✓	43.6 ± 0.3
I (PDAAL + PAC + ACRT)		✓		✓	✓	45.4 ± 0.4

Table 4. A comparison of average precision (%) on the NWPU VHR-10.v2 test dataset. “WSOD” means that the model is trained by using only image-level annotations. “LS-WSOD” means that image-level annotations and a small number of strong annotations are used.

Methods	Airplane	Ship	Storage	Baseball	Tennis	Basketball	Ground	Harbor	Bridge	Vehicle	mAP (%)
			Tank	Diamond	Court	Court	TrackField
WSOD:
WSDDN [1]	30.1	41.7	35.0	88.9	12.9	23.9	99.4	13.9	1.9	3.6	35.1
OICR [2]	13.7	67.4	57.2	55.2	13.6	39.7	92.8	0.2	1.8	3.7	34.5
TCANet [9]	89.4	78.2	78.4	90.8	35.3	50.4	90.9	42.4	4.1	28.3	58.8
SAENet [47]	82.9	74.5	50.2	96.7	55.7	72.9	100.0	36.5	6.3	31.9	60.7
WSODet [10]	95.3	75.6	81.9	98.0	20.9	56.7	100.0	29.8	5.1	48.1	61.3
MIST [21]	99.2	85.5	48.7	91.3	73.5	50.6	74.7	87.1	0.1	8.5	61.9
SPG [48]	90.4	81.0	59.5	92.3	35.6	51.4	99.9	58.7	17.0	43.0	62.8
PISLM [49]	87.6	80.1	57.3	94.0	36.4	80.4	100.0	56.9	9.8	35.6	63.8
IMIN [31]	90.8	81.6	56.6	91.7	51.9	69.5	100.0	53.4	16.3	40.5	65.2
LS-WSOD
(Budget 100):
BCNet [13]	60.1	89.3	89.5	86.5	77.7	70.8	99.2	57.8	14.5	59.2	70.5
ITnet [15]	89.0	86.1	90.2	82.2	78.2	79.3	99.6	78.0	15.7	56.2	75.3
Ours	98.3	88.6	90.9	90.8	73.0	75.9	96.1	91.4	24.8	68.3	79.8

Table 5. A comparison of average precision (%) on the DIOR test dataset. “WSOD” means that the model is trained by using only image-level annotations. “LS-WSOD” means that image-level annotations and a small number of strong annotations are used.

Methods	Airplane	Airport	Baseball	Basketball	Bridge	Chimney	Dam	Expressway	Expressway	Golf
			Field	Court				Service Area	Toll Station	Field
WSOD:
WSDDN [1]	9.1	39.7	37.8	20.2	0.3	12.2	0.6	0.7	11.9	4.9
OICR [2]	8.7	28.3	44.1	18.2	1.3	20.2	0.1	0.7	29.9	13.8
TCANet [9]	25.1	30.8	62.9	40.0	4.1	67.8	8.1	23.8	29.9	22.3
SPG [48]	31.3	36.7	62.8	29.1	6.1	62.7	0.3	15.0	30.1	35.0
SAENet [47]	20.6	62.4	62.7	23.5	7.6	64.6	0.2	34.5	30.6	55.4
WSODet [10]	32.2	53.3	66.5	76.6	0.1	57.1	0.1	0.1	0.4	42.8
MIST [21]	33.7	39.3	63.3	53.4	0.5	25.9	0.3	5.6	22.3	15.9
PISLM [49]	29.1	49.8	70.9	41.4	7.2	45.5	0.2	35.4	36.8	60.8
IMIN [31]	32.9	70.5	63.2	45.7	0.2	69.7	0.2	12.4	39.4	56.4
LS-WSOD
(Budget 200):
BCNet [13]	36.2	43.7	64.2	59.8	11.4	54.8	13.1	38.4	19.7	42.3
ITNet [15]	53.0	45.6	68.3	75.5	13.7	68.8	13.8	35.5	42.6	30.6
Ours	51.3	49.6	61.6	73.2	18.8	58.8	16.8	23.7	40.3	59.8
Methods	Ground	Harbor	Overpass	Ship	Stadium	Storage	Tennis	Train	Vehicle	Wind	mAP (%)
	Track Field					Tank	Court	Station		Mill
WSOD:
WSDDN [1]	42.4	4.7	1.1	0.7	63.0	4.0	6.1	0.5	4.6	1.1	13.3
OICR [2]	57.4	10.7	11.1	9.1	59.3	7.1	0.7	0.1	9.1	0.4	16.5
TCANet [9]	53.9	24.8	11.1	9.1	46.4	13.7	31.0	1.5	9.1	1.0	25.8
SPG [48]	48.0	27.1	12.0	10.0	60.0	15.1	21.0	9.9	3.2	0.1	25.8
SAENet [47]	52.7	17.6	6.9	9.1	51.6	15.4	1.7	14.4	1.4	9.2	27.1
WSODet [10]	66.6	0.1	1.9	2.0	52.6	22.4	68.8	0.2	1.2	0.3	27.3
MIST [21]	50.9	26.8	19.6	49.1	42.7	32.3	37.2	6.2	28.0	2.0	27.7
PISLM [49]	48.5	14.0	25.1	18.5	48.9	11.7	11.9	3.5	11.3	1.7	28.6
IMIN [31]	55.3	16.6	0.6	9.1	54.8	18.1	11.0	16.1	9.1	1.1	29.1
LS-WSOD
(Budget 200):
BCNet [13]	56.9	26.5	26.8	21.0	56.1	26.9	68.5	2.4	20.2	34.0	36.1
ITNet [15]	66.6	32.6	30.5	26.7	59.3	36.5	74.6	6.6	27.2	34.0	42.3
Ours	59.6	34.1	40.8	58.2	53.3	46.7	70.4	14.2	35.1	42.3	45.4

Table 6. A comparison of the computational performance of our PDEF and other models on the DIOR test dataset [19].

Methods	mAP (%)	Params (M)	Ratio	FLOPs (G)	Ratio	Latency (s)	Ratio
WSDDN [1]	13.3	268.6	1.99×	2696.1	6.70×	1.64	6.30×
OICR [2]	16.5	137.4	1.02×	402.2	1×	0.26	1×
BCNet [13]	36.1	134.7	1×	869.2	2.16×	0.37	1.42×
ITNet [15]	42.3	134.7	1×	869.2	2.16×	0.37	1.42×
Ours	45.4	135.5	1.01×	983.6	2.45×	0.42	1.62×

Table 7. A performance comparison of the image and box budgets on the DIOR test dataset. The table shows the results in two main rows. The first row displays the number of image samples that each active learning method can select per stage with a budget of 240 at the box unit, along with the mAP after 5 stages. The second row presents the results with a budget of 40 at the image unit, along with the mAP after 5 stages.

Active Strategy	PDAAL	BiB	u-r.	b-r.	Ent-Sum	Ent-Max	Core-Set	Core-Set-Ent
Anno. box/stage	240	240	240	240	240	240	240	240
Anno. IMG/stage	23	16	42	31	10	30	12	22
mAP (%)	42.7	42.0	41.4	41.0	36.6	40.0	38.1	39.4
Anno. box/stage	415	600	230	310	960	320	800	436
Anno. IMG/stage	40	40	40	40	40	40	40	40
mAP (%)	43.3	43.0	41.5	42.0	39.8	40.6	40.5	40.8

Table 8. Hyperparameter tuning for

λ

,

μ

, and

δ

.

Table 8. Hyperparameter tuning for

λ

,

μ

, and

δ

.

$λ$	$μ$	$δ$	mAP (%)
0.2	2	0.7	43.0 ± 0.3
0.3	2	0.7	43.1 ± 0.2
0.3	3	0.8	43.3 ± 0.4
0.4	3	0.8	42.9 ± 0.2
0.3	5	0.9	42.7 ± 0.2

Table 9. Hyperparameter tuning for

K_{h}

and

K_{v}

.

Table 9. Hyperparameter tuning for

K_{h}

and

K_{v}

.

$K_{h}$	$K_{v}$	mAP (%)
7	7	41.6 ± 0.3
9	9	41.8 ± 0.5
11	11	42.2 ± 0.4
13	13	42.4 ± 0.5
15	15	42.3 ± 0.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, P.; Huang, B.; Jin, T.; Long, H. Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning. Remote Sens. 2025, 17, 1155. https://doi.org/10.3390/rs17071155

AMA Style

Liu P, Huang B, Jin T, Long H. Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning. Remote Sensing. 2025; 17(7):1155. https://doi.org/10.3390/rs17071155

Chicago/Turabian Style

Liu, Peng, Boxue Huang, Tingting Jin, and Hui Long. 2025. "Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning" Remote Sensing 17, no. 7: 1155. https://doi.org/10.3390/rs17071155

APA Style

Liu, P., Huang, B., Jin, T., & Long, H. (2025). Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning. Remote Sensing, 17(7), 1155. https://doi.org/10.3390/rs17071155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Low-Shot Weakly Supervised Object Detection for Remote Sensing Images via Part Domination-Based Active Learning and Enhanced Fine-Tuning

Abstract

1. Introduction

2. Related Work

2.1. Weakly Supervised Object Detection

2.2. Low-Shot Weakly Supervised Object Detection

2.3. Active Learning for Object Detection

3. Methods

3.1. Overall Framework and Loss Functions

3.2. Part Domination-Based Adaptive Active Learning (PDAAL)

3.3. Parameter-Efficient Attention for Context (PAC)

3.4. Adaptive Category Resampling for Tuning (ACRT)

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

4.1.2. Implementation Details

4.2. Ablation Study

4.3. Comparisons

4.3.1. Comparisons with the State of the Art

4.3.2. Comparison of Different Active Learning Strategies

4.3.3. Comparison of Computational Performance

4.3.4. Comparison Between Image Budget and Box Budget

4.4. Qualitative Results

4.4.1. Strong Samples Selected by PDAAL

4.4.2. Attention Maps

4.4.3. Results on NWPU VHR-10.v2 and DIOR

4.5. Hyperparameters

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI