Next Article in Journal
Vegetation Growth Changes and Their Constraining Effects on Ecosystem Services Under Ecological Restoration in the Shendong Mining Area
Next Article in Special Issue
Multimodal Prompt-Guided Bidirectional Fusion for Referring Remote Sensing Image Segmentation
Previous Article in Journal
Application of Remote Sensing Floodplain Vegetation Data in a Dynamic Roughness Distributed Runoff Model
Previous Article in Special Issue
AGCD: An Attention-Guided Graph Convolution Network for Change Detection of Remote Sensing Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Feature-Guided Instance Mining and Task-Aligned Focal Loss for Weakly Supervised Object Detection in Remote Sensing Images

1
School of Aerospace Science and Technology, Xidian University, Xi’an 710126, China
2
Xi’an Institute of Space Radio Technology, Xi’an 710100, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(10), 1673; https://doi.org/10.3390/rs17101673
Submission received: 3 April 2025 / Revised: 30 April 2025 / Accepted: 8 May 2025 / Published: 9 May 2025

Abstract

:
Weakly supervised object detection (WSOD) in remote sensing images (RSIs) aims to achieve high-value object classification and localization using only image-level labels, and it has a wide range of applications. However, existing popular WSOD models still encounter two challenges. First, these WSOD models typically select the highest-scoring proposal as the seed instance while ignoring lower-scoring ones, resulting in some less-obvious objects being missed. Second, current models fail to ensure consistency between classification and regression, limiting the upper bound of WSOD performance. To address the first challenge, we propose a feature-guided seed instance mining (FGSIM) strategy to mine reliable seed instances. Specifically, FGSIM first selects multiple high-scoring proposals as seed instances and then leverages a feature similarity measure to mine additional seed instances among lower-scoring proposals. Furthermore, a contrastive loss is introduced to construct a credible similarity threshold for FGSIM by leveraging the consistent feature representations of instances within the same category. To address the second challenge, a task-aligned focal (TAF) loss is proposed to enforce consistency between classification and regression. Specifically, the localization difficulty score and classification difficulty score are used as weights for the regression and classification losses, respectively, thereby promoting their synchronous optimization by minimizing the TAF loss. Additionally, rotated images are incorporated into the baseline to encourage the model to make consistent predictions for objects with arbitrary orientations. Ablation studies validate the effectiveness of FGSIM, TAF loss, and their combination. Comparisons with popular models on two RSI datasets further demonstrate the superiority of our approach.

1. Introduction

Object detection in remote sensing images (RSIs) is essential for their interpretation. It provides technical support for landscape analysis [1,2], urban planning [3,4], and other applications [5,6,7]. Weakly supervised object detection (WSOD) [8,9,10] in RSIs accomplishes object categorization and localization by training detectors with only image-level annotations. Compared to fully supervised object detection [11,12,13,14], which requires instance-level annotations, WSOD significantly reduces annotation effort and has become a focal point of research.
The weakly supervised deep detection network (WSDDN) [15] first formulates WSOD as a multiple instance learning problem. In this framework, each input RSI is treated as a set of object instances, and the detector is optimized using image-level labels within the multiple instance learning paradigm. After training, the object detector predicts class scores for all proposals to determine whether or not the proposals are objects and their classes. Building on WSDDN, the online instance classifier refinement (OICR) framework [16] incorporates multiple ICR branches to iteratively refine proposal class scores. Specifically, the top-scoring proposal is selected as the seed instance for each class. The seed instances and their neighboring proposals are then treated as positive instances to supervise the next ICR branch.
Existing popular WSOD models [17,18,19,20,21,22] are built upon the OICR framework, incorporating various enhancements to achieve competitive performance. However, these models still encounter two significant challenges. First, some less-obvious objects are often overlooked. RSIs typically contain multiple objects of the same object type, yet the majority of the WSOD models select the top-scoring proposal as the seed instance. Consequently, proposals with relatively lower class scores, despite covering actual objects, are mistakenly classified as background, leading to missed detections. Second, the OICR framework in most WSOD models [9,23,24,25,26,27] incorporates multiple bounding box regression (BBR) branches to improve localization performance. However, they fail to account for the consistency between classification and regression, thereby limiting the capability of WSOD.
To overcome the first issue, a feature-guided seed instance mining (FGSIM) strategy is proposed to mine reliable seed instances overlooked by previous models that rely solely on the highest-scoring proposal. Specifically, the FGSIM strategy first selects multiple high-scoring proposals as initial seed instances and then leverages a feature similarity measure to mine additional seed instances among lower-scoring proposals. To construct a reliable similarity threshold for FGSIM, a contrastive loss is introduced, which enforces intra-class feature similarity while maintaining inter-class feature distinctiveness.
To handle the second challenge, we propose a task-aligned focal (TAF) loss to ensure consistency between classification and regression. Specifically, the localization difficulty score is used as the weight for the traditional classification loss, while the identification difficulty score serves as the weight for the traditional regression loss. Thus, minimizing the TAF loss enables the synchronous optimization of classification and regression.
Furthermore, since objects belonging to the same class often appear in different orientations in RSIs, it is crucial to encourage the model to make consistent predictions for objects with arbitrary orientations. To this end, inspired by the rotation-invariant aerial object detection network [20], we alternately feed the original RSI and its rotated counterpart into sequential ICR branches. This design enables consistent predictions for objects with arbitrary orientations, as each branch is supervised by the pseudo-labels generated from its preceding branch, thereby reinforcing rotation-invariant learning through cross-branch consistency.
Moreover, similar to the challenges encountered in optical RSIs, synthetic aperture radar imaging also faces difficulties related to signal separation under limited observation conditions, such as range ambiguity. Recent advances, like the blind source separation-based range ambiguity suppression method proposed in [28], highlight the importance of feature disentanglement, which shares conceptual parallels with the objective of WSOD in RSIs.
The contributions of this work can be summarized as follows:
1.
A novel FGSIM strategy is proposed to address the challenge where current models often detect salient objects while overlooking inconspicuous ones due to their reliance on selecting the top-scoring proposal as the seed instance. The FGSIM first selects high-scoring proposals as initial seed instances and then expands this set by mining additional seed instances based on a feature similarity measure. Furthermore, a contrastive loss is introduced to establish a reliable similarity threshold for FGSIM by leveraging the consistent feature representations of instances within the same category.
2.
A TAF loss is proposed to address inconsistencies between the classification and regression branches. The TAF loss utilizes localization and classification difficulty scores as weights for the regression and classification losses, respectively. Thus, minimizing the TAF loss enables the synchronous optimization of classification and regression.

2. Related Work

Similar to most existing WSOD methods, we adopt OICR [16] as the baseline framework, which is built upon the WSDDN framework [15] and incorporates multiple ICR branches. To further improve object localization performance, many WSOD models [9,23,24,25,26,27,29] extend the OICR architecture by integrating multiple BBR branches. Our method follows this direction by incorporating multiple BBR branches as well. Thus, we first review WSDDN and OICR, followed by an overview of the BBR branches. We then briefly summarize other relevant WSOD methods to provide a comprehensive background.

2.1. Weakly Supervised Deep Detection Network

As shown in Figure 1, the selective search algorithm [30], is first applied to generate proposals for image I, which are denoted as R = { r 1 , r a , r A } , where r a denotes ath proposal and A represents the quantity of proposals. The input RSI I and its proposals R = { r 1 , r a , r A } are imported into the convolutional network, followed by a region of interest (RoI) pooling layer, and two fully connected (FC) layers, to extract the feature vectors f a R 4096 a = 1 A for all proposals. Then, the feature vectors f a R 4096 a = 1 A are fed into two parallel FC layers to produce two matrices, denoted as M c R C × A and M d R C × A , where C represents the quantity of categories. The class scores of all proposals, represented as M R C × A , are computed as follows:
M = σ c ( M c ) σ d ( M d )
where σ d and σ c denote the softmax operation over proposal and class, respectively, and ⊙ denotes the elementwise product. The image-level class score of cth class, represented as ϕ c , is obtained using the following equation:
ϕ c = a = 1 A m c , a
where m c , a M represents the class score of ath proposal in the cth class. Finally, the loss of WSNND, represented as L W S D D N , is given by the following:
L W S D D N = c = 1 C y c log ϕ c + ( 1 y c ) log ( 1 ϕ c )
where y c is the image-level label, with y c = 1 indicating the presence of at least one object of the cth class in image I, and y c = 0 indicating its absence.

2.2. Online Instance Classifier Refinement

Building upon WSDDN [15], the OICR [16] incorporates B ICR branches to refine proposal class scores, where the ICR stream includes an FC layer and a softmax operation. The class scores of all proposals in the bth ICR branch, denoted as X b R ( C + 1 ) × A , are obtained by feeding the f a R 4096 a = 1 A into the bth ICR branch, where the (C + 1)th dimension of X b represents the background class. The instance-level pseudo labels of the bth ICR branch are derived from the class score matrix of the (b-1)th ICR stream. Notably, the supervision signals for the first ICR stream are derived from WSDDN. Specifically, for the bth ICR stream, the pseudo labels are assigned a value of 1 if the corresponding proposal exhibits sufficient overlap with the seed instance (i.e., the top-scoring proposal); otherwise, they are assigned a value of 0. The details are as follows:
a ¯ b c = argmax a x b 1 c , a
y b c , a = 1 0 i f I o U ( r a , r a ¯ b c ) > 0.5 o t h e r w i s e .
where a ¯ b c denotes the index of seed instance, x b 1 c , a X b 1 represents the class score of the ath proposal in the cth class of (b − 1)th ICR stream, I o U ( · , · ) represents the IoU between two proposals, and y b c , a denotes the instance-level pseudo label of ath proposal in the cth class of the bth ICR branch. The classification loss of the bth ICR branch, represented as L I C R b , is formulated as follows:
L I C R b = a = 1 A c = 1 C + 1 w b a y b c , a log x b c , a
where x b c , a X b represents the class score of ath proposal in the cth class of bth ICR branch, and w b a = x b 1 c , a ¯ b c denotes the loss weight [16].

2.3. Bounding Box Regression Branch

To improve the localization performance of WSOD, we extend the OICR framework with B BBR branches, which follows the common paradigm in existing WSOD approaches [31,32,33]. Each BBR branch contains an FC layer. For the bth BBR branch, the regression loss is defined using the smooth L1 function [34] and is given by the following:
L R E G b = 1 P b k = 1 P b w k b s m o o t h L 1 ( g k b , g ^ k b )
where P b represents the group of positive instances in bth ICR branch, · represents the cardinality of a group, g k b and g ^ k b correspond to the prediction and target offsets of the kth instance, respectively, and w k b has the same meaning as w b a defined in Equation (6).

2.4. Other WSOD Models

Several seed instance mining methods have been proposed to address the challenge of OICR, considering only the top-scoring proposal as the seed instance, which makes it difficult to detect multiple objects of the same category. They can be categorized into two groups.
The first category focuses on mining seed instances according to proposal class scores alone. For instance, the proposal cluster learning (PCL) model [35] groups proposals into multiple clusters based on their class scores. Seed instances are selected from the cluster with the highest class score. The multiple instance detection network [21] adaptively selects seed instances by analyzing the distribution of class scores. The multiple instance self-training (MIST) model [24] selects the top-k scoring proposals as seed instances. The rotation-invariant aerial object detection network [20] determines seed instances by selecting the highest-scoring instances from each affine transformation branch. Seed instances from multiple branches are then combined to form the final set of seed instances. Beyond these models, several other approaches follow a similar paradigm, such as the dynamic curriculum learning (DCL) model [9], the complementary detection network (CDN) [22], the high-quality instance mining (HQIM) model [36], the multiscale image-splitting-based feature enhancement module [25], and the self-guided proposal generation (SPG) strategy [37], and others.
The second category selects seed instances by combining class scores with additional cues. For example, the pseudo instance labels mining (PILM) strategy [38] utilizes a proposal quality score, which is composed of a dual-context projection score and class scores, to mine seed instances. The semantic segmentation-guided label mining (SGPM) strategy [23] identifies seed instances by incorporating segmentation information along with class scores. The multiple instance graph (MIG) strategy [39] employs a metric termed apparent similarity to mine seed instances. Qian et al. [8] leverage segmentation information inferred from SAM to refine proposal class scores. The refined class scores are then used to mine seed instances. Other similar methods include [26,29], etc.

3. Proposed Method

3.1. Overview

As shown in Figure 1, we adopt the OICR framework extended with multiple BBR branches as the foundation of our model. First, unlike OICR, which only processes the original image, we alternately feed the original image and its randomly rotated version into the ICR branches. Supervision information is iteratively propagated between these branches, encouraging consistent predictions for detected instances before and after rotation. Second, the proposed FGSIM strategy replaces the seed instance mining mechanism in OICR by leveraging feature similarity between proposals. It identifies additional seed instances with lower class scores, thus generating more informative and reliable supervisory signals. Furthermore, the contrastive loss constructs a credible similarity threshold for the FGSIM strategy by enhancing intra-class feature similarity while maintaining inter-class feature distinctiveness. Finally, although many WSOD methods extend OICR with multiple BBR branches, they often treat classification and localization as separate tasks. In contrast, our method explicitly enhances the consistency between them by introducing the TAF loss, which weights the classification loss with localization difficulty scores and the regression loss with classification difficulty scores. This enables the synchronous optimization of classification and regression through the minimization of the TAF loss.

3.2. Feature-Guided Seed Instance Mining Strategy

As previously discussed, existing WSOD models tend to disregard the proposals with lower class scores, even though these proposals may still encompass less obvious objects, thereby increasing the risk of missed detections. Intuitively, even if a proposal has a low class score, it may still belong to the same category as the top-scoring proposal if its feature vector is similar to that of the top-scoring proposal. Therefore, the proposed FGSIM strategy relies not only on proposal class scores to mine seed instances but also leverages feature similarity between instances to mine additional seed instances.
As described in Figure 1, the class scores of all proposals are initially leveraged to discover initial seed instances. Specifically, for the cth class, we rank the class scores within the set R. The top q percent of proposals ranked by the class score are selected to form the initial seed instance set. Notably, the size of this set scales with the number of proposals, making it adaptive to image content. The initial seed instance set for the bth ICR branch, denoted as S b c , is derived from the class score of (b − 1)th ICR stream, and the remaining proposal set is formulated as R b c . Importantly, unlike the other ICR branches, the initial seed instance set for the first ICR branch is derived from WSDDN.
To discover additional seed instances with lower class scores, we employ a feature similarity measure between the remaining proposals and the top-scoring proposal. The similarity threshold is adaptively determined by calculating the average similarity scores between the top-scoring proposal and its category-aligned proposals in the initial seed instance sets. The similarity threshold τ b c of class c in the bth ICR branch is calculated as follows:
τ b c = 1 | S c | j = 1 | S c | s i m ( φ ( f a ¯ b c ) , φ ( f s j c ) )
where S c = b = 1 B S b c , s j c S c denotes the jth instance in the S c , f s j c R 4096 ( f a ¯ b c R 4096 ) denotes the feature vector of s j c ( a ¯ b c ), and s i m ( · , · ) denotes the dot product between inputs. φ ( · ) denotes the feature refining operation, which is composed of two FC layers with a ReLU activation in between. The first FC layer maintains the input dimension of 4096, and the second reduces the features to 128 dimensions, followed by a normalization operation. Then, additional seed instances S ^ b c are selected as those whose similarity to the top-scoring proposal exceeds the similarity threshold τ b c .
S ^ b c = { i | s i m ( φ ( f i ) , φ ( f a ¯ b c ) ) > τ b c , i = 1 , 2 , , | R b c | }
where f i denotes the feature vector of the ith proposal in the remaining proposal set R b c . Finally, the seed instances of the cth category in the bth ICR stream are obtained by merging S ^ b c and S b c , followed by applying Non-Maximum Suppression (NMS) [40]. Subsequently, the seed instances and their neighboring instances serve as positive samples to supervise bth ICR and BBR branches.

3.3. Contrastive Loss

To enhance the consistency of feature vectors for instances within the same category, we introduce a contrastive loss. This loss function promotes the proximity of positive samples while increasing the distance between negative samples in the feature space. Specifically, we collect the feature vectors of positive instances from all ICR branches along with their corresponding pseudo labels into a collection G = { f z , t z } z = 1 G , where f z denotes the feature vector of the zth positive instance, and t z represents its pseudo label. The contrastive loss for the zth positive instance, denoted as L C L z , is formulated as follows:
L C L z = 1 M t z 1 l = 1 , l z | G | 1 { t z = t l } · log exp ( φ ( f l ) · φ ( f z ) / ε ) n = 1 , n z | G | exp ( φ ( f n ) · φ ( f z ) / ε )
where M t z : = l = 1 | G | 1 { t z = t l } , and ε denotes temperature parameter [41]. Finally, the contrastive loss for all positive instances, denoted as L C L , is obtained as follows:
L C L = z = 1 | G | L C L z

3.4. Task-Aligned Focal Loss

Existing WSOD models overlook the consistency between classification and localization, which constrains their overall performance. While focal loss [42] effectively emphasizes hard instances and enhances overall performance, it fails to consider the misalignment between classification and regression. To mitigate this issue, we propose a TAF loss that not only prioritizes hard samples but also enforces consistency between regression and classification, ultimately enhancing WSOD performance. The TAF loss is formulated as follows:
L T A F b = L ˜ I C R b + L ˜ R E G b
where L T A F b denotes bth TAF loss, L ˜ I C R b represents the TAF classification loss of the bth ICR branch, and L ˜ R E G b is the TAF regression loss of the bth BBR branch. The details of L ˜ I C R b and L ˜ R E G b are provided below.
The TAF classification loss incorporates the localization difficulty score into the traditional classification loss and is defined as follows:
L ˜ I C R b = e μ ( 1 I o U ) × L I C R b , L F L , i f r P b i f r N b
where N b and P b denote the collection of negative instances and positive instances in the bth ICR branch, respectively, r represents the proposal, L F L is the focal loss, and μ is a modulating factor defined as follows:
μ = log ( t + δ ) log ( δ ) log ( T + δ ) log ( δ )
where δ is a hyperparameter that controls the growth rate of μ , and T (t) represents the total (current) number of iterations. The modulation factor μ gradually increases as the model iterates, mitigating the impact of inaccurate supervisory information generated in the early stages.
The TAF regression loss accounts for the classification difficulty score in traditional regression loss and is defined as follows:
L ˜ R E G b = e μ ( 1 x ) L R E G b
where x denotes the class score of the proposal.
As formulated above, L T A F b is designed to ensure consistency between regression and classification. The rationale behind its effectiveness can be summarized as follows. First, the consistency between identification and localization is only considered for positive instances, while for negative instances, L ˜ I C R b applies the traditional focal loss. Second, the localization difficulty score e μ ( 1 I o U ) is used as the weight of L I C R b when r P b , and this strategy achieves two main goals. On the one hand, it assigns higher loss weight to more challenging instances, as e μ ( 1 I o U ) is proportional to location difficulty. On the other hand, as L ˜ I C R b is minimized, L I C R b approaches 0 while IoU approaches 1, which indicates that classification and regression are optimized synchronously. Similarly, the categorization difficulty score of a positive instance, denoted as e μ ( 1 x ) , is adapted as the weight of L R E G b . This mechanism ensures that harder instances receive larger loss weights in L ˜ R E G b . Furthermore, as L ˜ R E G b is minimized, L R E G b is refined, and x converges to 1, further enhancing the consistency between classification and regression.

3.5. Overall Training Loss

Our model’s total training loss, represented as L A L L , is given by the following:
L A L L = L W S D D N + L C L + b = 1 B L T A F b
During inference, the proposal feature vectors from the original image are fed into the individually trained BBR and ICR branches to compute the offsets and prediction scores. The obtained scores are then averaged to obtain preliminary detection results, which are further refined using the NMS operation to generate the final predictions.

4. Experiments

4.1. Datasets and Evaluation Metrics

The experiments are conducted on the NWPU VHR-10.v2 [43,44] and DIOR [45] datasets. The NWPU VHR-10.v2 dataset contains 1,172 RSIs, each with a resolution of 400 × 400 pixels, and a total of 2775 object instances across 10 categories. It is split into three subsets: a training set (679 images), a validation set (200 images), and a testing set (293 images), where the training and validation subsets are used for model training and the testing subset for evaluation. The DIOR dataset, consisting of 23,463 RSIs with a resolution of 800 × 800 pixels, contains 192,472 labeled instances across 20 categories. It is split into a training set (5862 images), a validation set (5863 images), and a testing set (11,738 images), where the training and validation subsets are used for training and the testing set for evaluation.
Detection performance on the testing set is assessed using mean average precision (mAP), while the localization capability is evaluated on the training and validation sets using correct localization (CorLoc) [46].

4.2. Implementation Details

The architecture of our convolutional network is shown in Table 1. Similar to most WSOD models, the ConvNet of our model is built on VGG16 [47], which is pretrained on ImageNet [48]. The FC layers are initialized using a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. Specifically, each convolutional layer is denoted as “Conv2d (kernel size, stride, padding)-(number of output channels)” and is followed by a ReLU activation. The feature map of the image is obtained by passing the RSI through the ConvNet. Both the feature map of the input RSI and proposals are then passed into the RoI pooling layer to obtain fixed-size proposal feature maps. These proposal feature maps are subsequently passed through two FC layers to produce proposal feature vectors. Finally, the proposal feature vectors are fed into three branches: WSDDN, OICR, and BBR. The class scores predicted by the WSDDN branch are used to generate pseudo-labels for the first ICR and BBR branches. Similarly, the class scores from the current ICR branch are used to generate pseudo-labels for the next ICR and BBR branches.
The stochastic gradient descent algorithm is used to optimize our convolutional network. During training, we employ a batch size of 2, a momentum of 0.9, and a weight decay of 0.005. For the DIOR and NWPU VHR-10.v2 datasets, training is conducted for 60K and 30K iterations, respectively. The initial learning rate is set to 0.0025 and is progressively reduced to 10% of its previous value at the 50Kth (20Kth) and 56Kth (26Kth) iterations for the DIOR (NWPU VHR-10.v2) dataset. The number of ICR streams is fixed at 3 (i.e., B = 3). Data augmentation techniques include horizontal flipping and rotations of 90° and 180°. During both training and testing, each RSI is resized to one of the following dimensions: {480, 576, 688, 864, 1200}. During inference, the threshold for NMS [40] is set to 0.3.
The proposed model, developed using the PyTorch (version 1.7.1) framework, is executed on an Ubuntu 16.04 system equipped with two Titan RTX GPUs.

4.3. Parameter Analysis

(1) Parameter Analysis of q: The proportion q is a crucial hyperparameter controlling the selection of initial seed instances. Its effect is quantitatively evaluated on the DIOR and NWPU VHR-10.v2 datasets in terms of mAP. To analyze its impact, we conduct experiments by varying q among {0.05, 0.10, 0.15, 0.20, 0.25}. As shown in Figure 2, the model achieves the highest mAP when q = 0.15. Therefore, q is set to 0.15.
(2) Parameter Analysis of ε: As illustrated in Figure 3, the temperature parameter ε is quantitatively analyzed on the DIOR and NWPU VHR-10.v2 datasets in terms of mAP. Specifically, we evaluate ε over the range {0.05, 0.10, 0.15, 0.20, 0.25, 0.30} to assess its impact. The results show that both datasets achieve the highest mAP when ε = 0.2. Therefore, we set ε = 0.2 as the default value in our model.
(3) Parameter Analysis of δ: As illustrated in Figure 4, the δ is quantitatively analyzed in terms of mAP across the DIOR and NWPU VHR-10.v2 datasets. We evaluate δ over the range {50, 100, 150, 200} to assess its impact. The results show that both datasets achieve the highest mAP when δ = 100. Therefore, we set δ = 100 as the default value in our model.

4.4. Ablation Study

4.4.1. Quantitative Ablation Study

To validate the effectiveness of the FGSIM strategy, TAF loss, and their combination, we conduct an ablation study on the NWPU VHR-10.v2 dataset by incrementally integrating these components into the baseline model and evaluating their performance in terms of mAP and CorLoc.
We first examine the impact of the FGSIM strategy by comparing the baseline model with its variant incorporating FGSIM, denoted as Baseline + FGSIM in Table 2. On the NWPU VHR-10.v2 dataset, Baseline + FGSIM achieves improvements of 16.3% in mAP and 12.0% in CorLoc, confirming the effectiveness of the FGSIM strategy.
Next, we assess the contribution of TAF loss by comparing the baseline model with Baseline + TAF (i.e., the combination of the baseline model and TAF loss). As shown in Table 2, Baseline + TAF achieves an improvement of 12.7% in mAP and 11.4% in CorLoc on the NWPU VHR-10.v2 dataset, demonstrating the effectiveness of TAF loss.
Finally, to analyze the combined effect of FGSIM and TAF loss, we compare the baseline model with Baseline + FGSIM + TAF on the NWPU VHR-10.v2 dataset. Baseline + FGSIM + TAF improves mAP by 23.5% and CorLoc by 17.1%, demonstrating the effectiveness of integrating FGSIM and TAF loss.

4.4.2. Subjective Ablation Study

We qualitatively compare the detection results of the baseline model with those of its enhanced variants (i.e., Baseline + FGSIM and Baseline + TAF) on the NWPU VHR-10.v2 dataset to further demonstrate the impact of the FGSIM strategy and TAF loss.
As shown in Figure 5, the Baseline fails to detect all airplanes with arbitrary orientations; however, our model can correctly identify them, which is attributed to the incorporation of the FGSIM module. The FGSIM module leverages feature similarity measurements to discover additional seed instances with lower class scores, thereby encouraging the model to detect as many foreground objects as possible.
As shown in Figure 6, the airplane on the right side of the image receives a high classification score but suffers from poor localization in the baseline model, indicating an inconsistency between classification confidence and localization accuracy. After applying the TAF loss, our model not only assigns a more accurate bounding box to the same airplane, but also provides a higher and more consistent classification score. This demonstrates that TAF loss effectively improves the consistency between classification and localization.

4.5. Quantitative Comparison with Popular Methods

To assess our model’s effectiveness, we conduct quantitative evaluations on the DIOR and NWPU VHR-10.v2 datasets in terms of CorLoc and mAP. Specifically, we compare its performance against two classical fully supervised object detection approaches (Faster R-CNN [49] and Fast R-CNN [34]) along with fifteen advanced WSOD models including WSDDN [15], MIST [24], OICR [16], DCL [9], CDN [22], SGPM [23], PILM [38], HQIM [36], the triple context-aware (TCA) model [19], the progressive contextual instance refinement (PCIR) model [18], the multiple instance graph (MIG) model [39], the self-guided proposal generation (SPG) model [37], the self-supervised adversarial and equivariant (SAE) network [50], instance-level feature refinement (ILFR) model [51], and multi-instance mining with dynamic localization (MIDL) model [52]. Among these, CDN and MIDL are rotation-invariant WSOD approaches.
As presented in Table 3, our model achieves 72.5% mAP on the NWPU VHR-10.v2 dataset, surpassing WSDDN by 37.4%, OICR by 38.0%, MIST by 21.0%, DCL by 20.4%, PCIR by 17.5%, MIG by 16.5%, TCA by 13.7%, CDN by 14.4%, SAE by 11.8%, SPG by 9.7%, PILM by 8.7%, SGPM by 7.3%, HQIM by 6.3%, ILFA by 7.3%, and MIDL by 10.6%, respectively. Similarly, as shown in Table 4, our model attains 76.89% CorLoc on the same dataset, demonstrating improvements of 43.4%, 38.6%, 8.3%, 8.9%, 6.7%, 8.4%, 5.8%, 6.2%, 5.1%, 5.2%, 4.3%, 3.2%, 1.7%, and 3.4% over WSDDN, OICR, MIST, DCL, PCIR, MIG, TCA, CDN, SAE, SPG, PILM, SGPM, HQIM, and ILFR, respectively.
As presented in Table 5, our model achieves 30.5% mAP on the challenging DIOR dataset, surpassing WSDDN, OICR, MIST, DCL, PCIR, MIG, TCA, CDN, SAE, SPG, PILM, SGPM, HQIM, ILFR, andMIDL by 17.2%, 14.0%, 8.3%, 10.3%, 5.6%, 5.4%, 4.7%, 3.8%, 3.4%, 4.7%, 1.9%, 2.0%, 1.6%, 1.4%, and 3.2%, respectively. Similarly, as shown in Table 6, our model attains 56.1% CorLoc on the DIOR dataset, demonstrating improvements of 23.7%, 21.3%, 12.5%, 13.9%, 10.0%, 9.3%, 7.7%, 8.2%, 6.7%, 7.8%, 2.9%, 2.9%, 2.2%, and 3.8% over WSDDN, OICR, MIST, DCL, PCIR, MIG, TCA, CDN, SAE, SPG, PILM, SGPM, HQIM, and ILFR, respectively.
Furthermore, our approach narrows the performance gap between WSOD and fully supervised object detection models across both datasets. The results further confirm the effectiveness of our method, highlighting its advantage over advanced WSOD approaches on two datasets.

4.6. Subjective Evaluation

Figure 7 and Figure 8 illustrate our model’s detection results on the NWPU VHR-10.v2 and DIOR datasets, respectively. We visualize results from 24 test images to demonstrate the model’s effectiveness. Furthermore, to qualitatively compare our model with four representative WSOD approaches (i.e., OICR, MIST, PILM, and SGPM), we present detailed visual analyses in Figure 9. Specifically, as shown in Figure 9a, existing methods fail to detect all airplanes, especially those with arbitrary orientations. In contrast, our model accurately identifies all airplane instances. This improvement is due to two key components: First, the FGSIM strategy leverages feature similarity between instances to mine additional seed instances with lower class scores, thus providing richer supervisory signals. Second, the original RSI and its rotated counterpart are processed by separate ICR branches, allowing for consistent predictions via iterative supervision between ICR branches. In Figure 9b, methods such as MIST, PILM, and SGPM misclassify the shadows of storage tanks as foreground objects. Our model avoids this misclassification thanks to a contrastive loss that enforces intra-class feature similarity while enhancing inter-class distinction. Finally, in Figure 9c,d, other methods show inaccurate localization, with some boxes encompassing background or irrelevant regions. By contrast, our model assigns both accurate categories and precise locations to foreground instances, benefiting from the TAF loss.

4.7. Performance on Natural Images

To further evaluate the generalization capability of our method, we conduct comparisons with two fully supervised object detection methods and four WSOD methods on the PASCAL VOC 2007 dataset [53]. The PASCAL VOC 2007 dataset consists of 9963 images, divided into a training set (2501 images), a validation set (2510 images), and a test set. Following standard practice, we use both the training and validation sets for model training and evaluate performance on the test set. The training configuration for PASCAL VOC 2007 is similar to that of the NWPU VHR-10.v2 dataset.
As reported in Table 7, our method achieves a mAP of 58.1% on the PASCAL VOC 2007 test set, outperforming WSDDN, OICR, PCL, and MELM by 23.3%, 16.9%, 14.6%, and 10.8%, respectively. Moreover, as illustrated in Figure 10, our model is able to accurately localize and classify foreground objects, demonstrating both high category precision and bounding box quality.

5. Discussion

In this work, we presented a novel approach to WSOD in RSIs, which addresses two key challenges prevalent in current WSOD models: the tendency to overlook less-obvious objects and the inconsistency between classification and regression tasks. We proposed an FGSIM strategy to effectively select reliable seed instances and a TAF loss to ensure the synchronization of classification and regression. In this section, we will discuss the effectiveness of these contributions and explore their implications.
One of the main challenges in WSOD for RSIs is the tendency of existing models to overlook less-obvious objects because they select the top-scoring proposals as seed instances. This limitation arises because the highest-scoring proposals may correspond to more prominent objects, while smaller, less obvious objects are often missed. Our proposed FGSIM strategy addresses this issue by first selecting multiple high-scoring proposals and then expanding this selection by mining lower-scoring but reliable proposals based on the feature similarities between samples. Furthermore, a contrastive loss is introduced to establish a reliable feature similarity threshold for FGSIM by enforcing intra-class similarity and inter-class distinctiveness. Our experimental results demonstrate the efficacy of the FGSIM strategy in overcoming this challenge. By utilizing both high-scoring and feature-similar low-scoring proposals, our model successfully detects objects that would have otherwise been missed by traditional WSOD models.
Another challenge faced by existing WSOD models is the inconsistency between the classification and regression branches. Many models use separate losses for classification and regression, but these losses are typically optimized independently, which can lead to suboptimal performance. To address this, we introduced the TAF loss, which dynamically adjusts the weights of the classification and regression losses based on the localization and classification difficulty scores. By aligning the two tasks, the TAF loss encourages the model to learn both classification and localization jointly, thus improving overall performance. Our experimental results validate the effectiveness of the TAF loss. The synchronized optimization of classification and regression enables our model to achieve better localization accuracy without sacrificing classification performance.

6. Conclusions

This paper proposes FGSIM, a feature-guided seed instance mining strategy designed to address the challenge where current models often detect only salient objects while overlooking less-obvious ones due to their reliance on selecting the highest-scoring proposal as the seed instance. FGSIM first selects high-scoring proposals as initial seed instances and then expands this set based on a feature similarity measure between samples. To establish a reliable similarity threshold for FGSIM, a contrastive loss is introduced to encourage consistent feature representations within the same category. In addition, we propose a TAF loss to address inconsistencies between the classification and regression branches. The TAF loss leverages classification and localization difficulty scores as weights for the regression and classification losses, respectively, enabling their joint optimization. Additionally, the original RSI and its rotated counterpart are fed into different ICR branches, facilitating consistent predictions for detected instances before and after rotation through supervision between neighboring ICR branches. Ablation studies demonstrate the effectiveness of each component, while comparisons with popular models on two RSI benchmarks verify the overall superiority of our approach.
The proposed method has significant implications for improving WSOD in RSIs, especially in cluttered or complex scenes where conventional methods struggle with non-salient targets. However, the localization accuracy remains dependent on the quality of initial proposals, which are generated using traditional selective search algorithms that may not align well with object boundaries. Future work aims to explore more advanced, segmentation-driven, or learning-based proposal generation mechanisms to obtain high-quality proposals, further improving localization precision.

Author Contributions

Conceptualization, J.T.; formal analysis, C.W. and X.T.; methodology, H.W. and C.W.; project administration, M.Z. and J.T.; resources, J.T. and M.Z.; software, J.T. and C.W.; supervision, H.W. and X.T.; validation, H.W. and X.T.; writing—original draft, C.W. and J.T.; writing—review and editing, C.W. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The NWPU VHR-10.v2 and DIOR datasets are available at following URLs: https://drive.google.com/file/d/15xd4TASVAC2irRf02GA4LqYFbH7QITR-/view (accessed on 20 January 2024) and https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC (accessed on 20 January 2024), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WSOD Weakly Supervised Object Detection
RSI Remote Sensing Image
TAF Task-aligned Focal
FGSIM Feature-guided Seed Instance Mining
WSDDN Weakly Supervised Deep Detection Network
OICR Online Instance Classifier Refinement
BBR Bounding Box Regression
RoI Region of Interest
NMS Non-Maximum Suppression

References

  1. Șerban, R.D.; Șerban, M.; He, R.; Jin, H.; Li, Y.; Li, X.; Wang, X.; Li, G. 46-Year (1973–2019) Permafrost Landscape Changes in the Hola Basin, Northeast China Using Machine Learning and Object-Oriented Classification. Remote Sens. 2021, 13, 910. [Google Scholar] [CrossRef]
  2. Li, W.; Yu, Y.; Meng, F.; Duan, J.; Zhang, X. A image fusion and U-Net approach to improving crop planting structure multi-category classification in irrigated area. J. Intell. Fuzzy Syst. 2023, 45, 185–198. [Google Scholar] [CrossRef]
  3. Chen, Q.; Huang, M.; Wang, H. A Feature Discretization Method for Classification of High-Resolution Remote Sensing Images in Coastal Areas. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8584–8598. [Google Scholar] [CrossRef]
  4. Maneepong, K.; Yamanotera, R.; Akiyama, Y.; Miyazaki, H.; Miyazawa, S.; Akiyama, C.M. Towards High-Resolution Population Mapping: Leveraging Open Data, Remote Sensing, and AI for Geospatial Analysis in Developing Country Cities—A Case Study of Bangkok. Remote Sens. 2025, 17, 1204. [Google Scholar] [CrossRef]
  5. Somanath, S.; Naserentin, V.; Eleftheriou, O.; Sjölie, D.; Wästberg, B.S.; Logg, A. Towards Urban Digital Twins: A Workflow for Procedural Visualization Using Geospatial Data. Remote Sens. 2024, 16, 1939. [Google Scholar] [CrossRef]
  6. Tian, T.; Pan, M.; Zhang, F.; Cong, W.; Han, X.; Zhang, J. A 3D GIS-based underground construction deformation display system. In Proceedings of the 2010 18th International Conference on Geoinformatics, Beijing, China, 18–20 June 2010; pp. 1–6. [Google Scholar] [CrossRef]
  7. Zeng, B.; Gao, S.; Xu, Y.; Zhang, Z.; Li, F.; Wang, C. Detection of Military Targets on Ground and Sea by UAVs with Low-Altitude Oblique Perspective. Remote Sens. 2024, 16, 1288. [Google Scholar] [CrossRef]
  8. Qian, X.; Lin, C.; Chen, Z.; Wang, W. SAM-Induced Pseudo Fully Supervised Learning for Weakly Supervised Object Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1532. [Google Scholar] [CrossRef]
  9. Yao, X.; Feng, X.; Han, J.; Cheng, G.; Guo, L. Automatic weakly supervised object detection from high spatial resolution remote sensing images via dynamic curriculum learning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 675–685. [Google Scholar] [CrossRef]
  10. Fasana, C.; Pasini, S.; Milani, F.; Fraternali, P. Weakly Supervised Object Detection for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 5362. [Google Scholar] [CrossRef]
  11. Qian, X.; Wu, B.; Cheng, G.; Yao, X.; Wang, W.; Han, J. Building a Bridge of Bounding Box Regression Between Oriented and Horizontal Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605209. [Google Scholar] [CrossRef]
  12. Shi, W.; Zhang, S.; Zhang, S. CAW-YOLO: Cross-Layer Fusion and Weighted Receptive Field-Based YOLO for Small Object Detection in Remote Sensing. CMES-Comput. Model. Eng. Sci. 2024, 139, 3209–3231. [Google Scholar] [CrossRef]
  13. Zhou, S.; Liu, Z.; Luo, H.; Qi, G.; Liu, Y.; Zuo, H.; Zhang, J.; Wei, Y. GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery. Remote Sens. 2025, 17, 1077. [Google Scholar] [CrossRef]
  14. Shi, R.; Zhang, L.; Wang, G.; Jia, S.; Zhang, N.; Wang, C. GD-Det: Low-Data Object Detection in Foggy Scenarios for Unmanned Aerial Vehicle Imagery Using Re-Parameterization and Cross-Scale Gather-and-Distribute Mechanisms. Remote Sens. 2025, 17, 783. [Google Scholar] [CrossRef]
  15. Bilen, H.; Vedaldi, A. Weakly supervised deep detection networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2846–2854. [Google Scholar]
  16. Tang, P.; Wang, X.; Bai, X.; Liu, W. Multiple instance detection network with online instance classifier refinement. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2843–2851. [Google Scholar]
  17. Huang, Z.; Zou, Y.; Kumar, B.V.K.V.; Huang, D. Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Newry, UK, 2020; Volume 33, pp. 16797–16807. [Google Scholar]
  18. Feng, X.; Han, J.; Yao, X.; Cheng, G. Progressive contextual instance refinement for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8002–8012. [Google Scholar] [CrossRef]
  19. Feng, X.; Han, J.; Yao, X.; Cheng, G. TCANet: Triple Context-Aware Network for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6946–6955. [Google Scholar] [CrossRef]
  20. Feng, X.; Yao, X.; Cheng, G.; Han, J. Weakly Supervised Rotation-Invariant Aerial Object Detection Network. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14126–14135. [Google Scholar] [CrossRef]
  21. Wu, Z.; Wen, J.; Xu, Y.; Yang, J.; Zhang, D. Multiple Instance Detection Networks With Adaptive Instance Refinement. IEEE Trans. Multimed. 2023, 25, 267–279. [Google Scholar] [CrossRef]
  22. Huo, Y.; Qian, X.; Li, C.; Wang, W. Multiple Instance Complementary Detection and Difficulty Evaluation for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6006505. [Google Scholar] [CrossRef]
  23. Qian, X.; Li, C.; Wang, W.; Yao, X.; Cheng, G. Semantic segmentation guided pseudo label mining and instance re-detection for weakly supervised object detection in remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 119, 103301. [Google Scholar] [CrossRef]
  24. Ren, Z.; Yu, Z.; Yang, X.; Liu, M.Y.; Lee, Y.J.; Schwing, A.G.; Kautz, J. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10598–10607. [Google Scholar]
  25. Qian, X.; Wang, C.; Li, C.; Li, Z.; Zeng, L.; Wang, W.; Wu, Q. Multiscale Image Splitting Based Feature Enhancement and Instance Difficulty Aware Training for Weakly Supervised Object Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7497–7506. [Google Scholar] [CrossRef]
  26. Seo, J.; Bae, W.; Sutherland, D.J.; Noh, J.; Kim, D. Object Discovery via Contrastive Learning for Weakly Supervised Object Detection. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 312–329. [Google Scholar]
  27. Qian, X.; Wang, C.; Wang, W.; Yao, X.; Cheng, G. Complete and Invariant Instance Classifier Refinement for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5627713. [Google Scholar] [CrossRef]
  28. Chang, S.; Deng, Y.; Zhang, Y.; Zhao, Q.; Wang, R.; Zhang, K. An Advanced Scheme for Range Ambiguity Suppression of Spaceborne SAR Based on Blind Source Separation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5230112. [Google Scholar] [CrossRef]
  29. Qian, X.; Huo, Y.; Cheng, G.; Yao, X.; Li, K.; Ren, H.; Wang, W. Incorporating the Completeness and Difficulty of Proposals Into Weakly Supervised Object Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1902–1911. [Google Scholar] [CrossRef]
  30. Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
  31. Liu, B.; Gao, Y.; Guo, N.; Ye, X.; Wan, F.; You, H.; Fan, D. Utilizing the Instability in Weakly Supervised Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  32. Chen, Z.; Fu, Z.; Jiang, R.; Chen, Y.; Hua, X.S. SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  33. Feng, X.; Yao, X.; Shen, H.; Cheng, G.; Xiao, B.; Han, J. Learning an Invariant and Equivariant Network for Weakly Supervised Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11977–11992. [Google Scholar] [CrossRef]
  34. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  35. Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; Yuille, A. PCL: Proposal Cluster Learning for Weakly Supervised Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 176–191. [Google Scholar] [CrossRef]
  36. Xing, P.; Huang, M.; Wang, C.; Cao, Y. High-Quality Instance Mining and Weight Re-Assigning for Weakly Supervised Object Detection in Remote Sensing Images. Electronics 2024, 13, 4753. [Google Scholar] [CrossRef]
  37. Cheng, G.; Xie, X.; Chen, W.; Feng, X.; Yao, X.; Han, J. Self-Guided Proposal Generation for Weakly Supervised Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625311. [Google Scholar] [CrossRef]
  38. Qian, X.; Huo, Y.; Cheng, G.; Gao, C.; Yao, X.; Wang, W. Mining High-Quality Pseudoinstance Soft Labels for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607615. [Google Scholar] [CrossRef]
  39. Wang, B.; Zhao, Y.; Li, X. Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613112. [Google Scholar] [CrossRef]
  40. Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515. [Google Scholar]
  41. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. arXiv 2021, arXiv:2004.11362. [Google Scholar]
  42. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
  43. Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
  44. Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-Insensitive and Context-Augmented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2337–2348. [Google Scholar] [CrossRef]
  45. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  46. Deselaers, T.; Alexe, B.; Ferrari, V. Weakly supervised localization and learning with generic knowledge. Int. J. Comput. Vis. 2012, 100, 275–293. [Google Scholar] [CrossRef]
  47. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  48. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Conference Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  49. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  50. Feng, X.; Yao, X.; Cheng, G.; Han, J.; Han, J. SAENet: Self-Supervised Adversarial and Equivariant Network for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5610411. [Google Scholar] [CrossRef]
  51. Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z. Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement. Remote Sens. 2024, 16, 1203. [Google Scholar] [CrossRef]
  52. Guo, C.; Ma, Z.; Zhao, Y.; Cao, C.; Jiang, Z.; Zhang, H. Multi-instance mining with dynamic localization for weakly supervised object detection in remote-sensing images. Int. J. Remote Sens. 2025, 46, 3487–3512. [Google Scholar] [CrossRef]
  53. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  54. Wan, F.; Wei, P.; Jiao, J.; Han, Z.; Ye, Q. Min-entropy latent model for weakly supervised object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1297–1306. [Google Scholar]
Figure 1. The framework of our model.
Figure 1. The framework of our model.
Remotesensing 17 01673 g001
Figure 2. Parameter analysis of q on the NWPU VHR-10.v2 (a) and DIOR (b) datasets.
Figure 2. Parameter analysis of q on the NWPU VHR-10.v2 (a) and DIOR (b) datasets.
Remotesensing 17 01673 g002
Figure 3. Parameter analysis of ε on the NWPU VHR-10.v2 (a) and DIOR (b) datasets.
Figure 3. Parameter analysis of ε on the NWPU VHR-10.v2 (a) and DIOR (b) datasets.
Remotesensing 17 01673 g003
Figure 4. Parameter analysis of δ on the NWPU VHR-10.v2 (a) and DIOR (b) datasets.
Figure 4. Parameter analysis of δ on the NWPU VHR-10.v2 (a) and DIOR (b) datasets.
Remotesensing 17 01673 g004
Figure 5. Qualitative comparison of detection results between the Baseline and Baseline + FGSIM on the NWPU VHR-10.v2 dataset.
Figure 5. Qualitative comparison of detection results between the Baseline and Baseline + FGSIM on the NWPU VHR-10.v2 dataset.
Remotesensing 17 01673 g005
Figure 6. Qualitative comparison of detection results between the Baseline and Baseline + TAF loss on the NWPU VHR-10.v2 dataset.
Figure 6. Qualitative comparison of detection results between the Baseline and Baseline + TAF loss on the NWPU VHR-10.v2 dataset.
Remotesensing 17 01673 g006
Figure 7. Detection results of our model on the NWPU VHR-10.v2 dataset.
Figure 7. Detection results of our model on the NWPU VHR-10.v2 dataset.
Remotesensing 17 01673 g007
Figure 8. Detection results of our model on the DIOR dataset.
Figure 8. Detection results of our model on the DIOR dataset.
Remotesensing 17 01673 g008
Figure 9. Qualitative comparison of detection results between our model and advanced WSOD models on two RSI benchmarks. (a,d) are sampled from the NWPU VHR-10.v2 dataset, while (b,c) are from the DIOR dataset.
Figure 9. Qualitative comparison of detection results between our model and advanced WSOD models on two RSI benchmarks. (a,d) are sampled from the NWPU VHR-10.v2 dataset, while (b,c) are from the DIOR dataset.
Remotesensing 17 01673 g009
Figure 10. Detection results of our model on the PASCAL VOC 2007 dataset.
Figure 10. Detection results of our model on the PASCAL VOC 2007 dataset.
Remotesensing 17 01673 g010
Table 1. Detailed architecture of our convolutional network.
Table 1. Detailed architecture of our convolutional network.
ConvNet
Conv2d (kernel 3, stride 1, padding 1)-64, ReLU
Conv2d (kernel 3, stride 1, padding 1)-64, ReLU
Max Pooling (kernel = 2. Stride 1, padding 0)
Conv2d (kernel 3, stride 1, padding 1)-128, ReLU
Conv2d (kernel 3, stride 1, padding 1)-128, ReLU
Max Pooling (kernel = 2. Stride 1, padding 0)
Conv2d (kernel 3, stride 1, padding 1)-256, ReLU
Conv2d (kernel 3, stride 1, padding 1)-256, ReLU
Conv2d (kernel 3, stride 1, padding 1)-256, ReLU
Max Pooling (kernel = 2. Stride 1, padding 0)
Conv2d (kernel 3, stride 1, padding 1)-512, ReLU
Conv2d (kernel 3, stride 1, padding 1)-512, ReLU
Conv2d (kernel 3, stride 1, padding 1)-512, ReLU
Max Pooling (kernel = 2. Stride 1, padding 0)
Conv2d (kernel 3, stride 1, padding 1)-512, ReLU
Conv2d (kernel 3, stride 1, padding 1)-512, ReLU
Conv2d (kernel 3, stride 1, padding 1)-512, ReLU
Region of Interest (RoI) Pooling
FC-4096, ReLU, FC-4096, ReLU
WSDDNOICRBBR
FC-(number of classes), Softmax
FC-(number of classes), Softmax
[FC-(number of classes
+ 1), Softmax] × B
[FC-((number of
classes + 1) × 4)] × B
Table 2. Ablation Study of FGSIM and TAF loss on the NWPU VHR-10.v2 dataset.
Table 2. Ablation Study of FGSIM and TAF loss on the NWPU VHR-10.v2 dataset.
BaselineFGSIMTAFNWPU VHR-10.v2
mAPCorLoc
49.061.5
65.373.5
61.772.9
72.578.6
Table 3. Comparisons of different works in terms of mAP (%) on the NWPU VHR-10.v2 dataset.
Table 3. Comparisons of different works in terms of mAP (%) on the NWPU VHR-10.v2 dataset.
MethodAirplaneShipStorage
Tank
Baseball
Diamond
Tennis
Court
Basketball
Court
Ground
Track Field
HarborBridgeVehiclemAP
Fast R-CNN [34]90.990.689.347.3100.085.984.988.280.369.882.7
Faster R-CNN [49]90.986.390.598.289.769.6100.080.161.578.184.5
WSDDN [15]30.141.735.088.912.923.999.413.91.93.635.1
OICR [16]13.767.457.255.213.639.792.80.21.83.734.5
MIST [24]69.749.248.680.927.179.991.347.08.313.451.5
DCL [9]72.774.337.182.636.942.384.039.616.835.052.1
PCIR [18]90.878.836.490.822.652.288.542.411.735.555.0
MIG [39]88.771.675.294.237.547.7100.027.38.39.156.0
TCA [19]89.478.278.490.835.350.490.942.44.128.358.8
CDN [22]82.979.046.190.935.877.6100.045.42.221.358.1
SAE [50]82.974.550.296.755.772.9100.036.56.331.960.7
SPG [37]90.481.059.592.335.651.499.958.717.043.062.8
PILM [38]87.681.057.394.036.480.4100.056.99.835.663.8
SGPM [23]90.779.969.397.541.677.5100.044.417.233.565.2
HQIM [36]90.880.573.496.647.478.9100.043.918.433.566.2
ILFR [51]90.881.656.691.751.969.5100.053.416.340.565.2
MIDL [52]90.165.367.490.950.262.899.833.216.742.361.9
Ours90.881.472.994.756.481.8100.068.715.362.772.5
Table 4. Comparisons of different works in terms of CorLoc (%) on the NWPU VHR-10.v2 dataset.
Table 4. Comparisons of different works in terms of CorLoc (%) on the NWPU VHR-10.v2 dataset.
MethodAirplaneShipStorage
Tank
Baseball
Diamond
Tennis
Court
Basketball
Court
Ground
Track Field
HarborBridgeVehicleCorLoc
WSDDN [15]22.336.840.092.518.024.299.314.81.72.935.2
OICR [16]29.483.320.581.840.932.186.67.43.714.440.0
MIST [24]90.282.580.398.648.587.498.366.514.635.870.3
DCL [9]----------69.7
PCIR [18]100.093.164.199.364.879.389.763.013.352.271.9
MIG [39]97.890.387.298.754.964.2100.074.113.021.670.2
TCA [19]96.991.895.188.766.962.896.054.219.655.572.8
CDN [22]96.581.367.195.366.986.3100.074.210.346.572.4
SAE [50]97.191.787.898.740.981.1100.070.414.852.273.5
SPG [37]98.192.770.199.751.980.196.272.413.060.073.4
PILM [38]94.486.668.597.869.887.5100.068.616.056.674.3
SGPM [23]98.293.889.399.150.288.9100.071.012.351.275.4
HQIM [36]99.394.891.798.058.491.7100.068.813.656.776.9
ILFR [51]98.694.776.483.961.382.4100.078.119.557.275.2
Ours92.192.987.294.168.490.7100.078.918.962.778.6
Table 5. Comparisons of different works in terms of mAP (%) on the DIOR dataset.
Table 5. Comparisons of different works in terms of mAP (%) on the DIOR dataset.
MethodAirplaneAirportBaseball
Field
Basketball
Court
BridgeChimneyDamExpressway
Service Area
Expressway
Toll Station
Golf Field
Fast R-CNN [34]44.266.867.060.515.672.352.065.944.872.1
Faster R-CNN [49]50.362.666.080.928.868.247.358.548.160.4
WSDDN [15]9.139.737.820.20.312.30.60.711.94.9
OICR [16]8.728.344.118.21.320.20.10.729.913.8
MIST [24]32.039.962.729.07.512.90.35.117.451.0
DCL [9]20.922.754.211.56.061.00.11.131.030.9
PCIR [18]30.436.154.226.69.158.60.29.736.232.6
MIG [39]22.252.662.825.88.567.40.78.928.757.3
TCA [19]25.130.862.940.04.167.88.123.829.922.3
CND [22]31.115.810.754.515.733.93.712.337.212.0
SAE [50]20.662.462.723.57.664.60.234.530.655.4
SPG [37]31.336.762.829.16.162.70.315.030.135.0
PILM [38]29.149.870.941.47.245.50.235.436.860.8
SGLM [23]39.164.664.426.96.362.30.912.226.355.3
HQIM [36]42.265.066.225.76.760.21.313.525.357.8
ILFR [51]32.970.563.245.70.269.70.212.439.456.4
MIDL [52]38.1 12.3 62.9 27.4 10.7 62.0 0.1 0.6 32.2 49.4
Ours37.562.760.930.29.263.41.916.837.654.7
MethodGround
track field
HarborOverpassShipStadiumStorage
tank
Tennis
court
Train
station
VehicleWindmillmAP
Fast R-CNN [34]62.946.238.032.171.035.058.337.919.238.150.0
Faster R-CNN [34]67.043.946.958.552.442.479.548.034.865.455.5
WSDDN [15]42.54.71.10.763.04.06.10.54.61.113.3
OICR [16]57.410.711.19.159.37.10.70.19.10.416.5
MIST [24]49.55.412.229.435.525.40.84.622.20.822.2
DCL [9]56.55.12.79.163.79.110.40.07.30.820.2
PCIR [18]58.58.621.612.164.39.113.60.39.17.524.9
MIG [39]47.723.80.86.454.113.24.114.80.22.425.1
TCA [19]53.924.811.19.146.413.731.01.59.11.025.8
CDN [22]44.260.90.712.337.237.519.252.61.12.926.7
SAE [50]52.717.66.99.151.615.41.714.41.49.227.1
SPG [37]48.027.112.010.060.015.121.09.93.20.125.8
PILM [38]48.514.025.118.548.911.711.93.511.31.728.6
SGLM [23]60.69.423.113.457.417.71.514.011.53.528.5
HQIM [36]61.410.420.414.158.618.92.213.610.04.728.9
ILFR [51]55.316.60.69.154.818.111.016.19.11.129.1
MIDL [52] 64.7 7.4 29.0 9.1 53.8 11.3 57.0 7.7 1.6 0.5 27.3
Our57.923.724.917.255.911.312.114.912.83.630.5
Table 6. Comparisons of different works in terms of CorLoc (%) on the DIOR dataset.
Table 6. Comparisons of different works in terms of CorLoc (%) on the DIOR dataset.
MethodAirplaneAirportBaseball
Field
Basketball
Court
BridgeChimneyDamExpressway
Service Area
Expressway
Toll Station
Golf Field
WSDDN [15]5.759.994.255.94.923.41.06.844.512.8
OICR [16]16.051.594.855.83.623.90.04.856.722.4
MIST [24]91.653.293.566.310.830.71.514.335.247.5
DCL [9]----------
PCIR [18]93.145.695.568.33.692.10.25.458.447.5
MIG [39]77.046.995.463.623.095.10.217.057.950.8
TCA [19]81.651.396.273.55.094.715.932.846.048.6
CDN [22]85.658.795.695.610.596.20.517.660.745.9
SAE [50]91.269.495.567.518.997.80.270.554.351.4
SPG [37]80.532.098.765.015.296.122.517.046.151.0
PILM [38]85.568.996.875.811.694.70.867.560.546.5
SGLM [23]92.258.397.874.216.295.20.351.356.252.3
HQIM [36]94.159.398.170.517.694.64.852.254.353.5
ILFR [51]88.369.198.869.419.997.80.324.756.254.4
Ours91.970.498.571.618.292.41.967.659.459.7
MethodGround
track field
HarborOverpassShipStadiumStorage
tank
Tennis
court
Train
station
VehicleWindmillCorLoc
WSDDN [15]89.95.510.023.098.579.615.13.511.63.232.4
OICR [16]91.418.218.731.898.381.37.51.215.82.034.8
MIST [24]87.138.623.450.780.589.222.411.522.22.443.6
DCL [9]----------42.2
PCIR [18]88.615.85.239.598.185.613.456.59.70.646.1
MIG [39]89.442.119.837.997.980.713.810.310.56.946.8
TCA [19]85.338.920.230.684.691.556.33.810.51.348.4
CDN [22]90.651.517.445.789.772.011.310.617.96.547.9
SAE [50]88.348.02.333.614.183.465.619.916.42.949.4
SPG [37]89.249.522.035.298.690.032.612.710.02.348.3
PILM [38]75.250.528.339.792.677.055.110.120.95.653.2
SGLM [23]91.748.623.032.798.889.343.519.518.34.053.2
HQIM [36]93.049.722.034.899.090.444.118.317.510.553.9
ILFR [51]89.649.119.434.596.784.763.215.611.63.452.3
Ours91.152.123.744.798.786.945.220.918.78.756.1
Table 7. Comparisons with popular models on the PASCAL VOC 2007 dataset in terms of mAP (%).
Table 7. Comparisons with popular models on the PASCAL VOC 2007 dataset in terms of mAP (%).
MethodAeroplaneBicycleBirdBoatBottleBusCarCatChairCow
Fast R-CNN [34]74.578.369.253.236.677.378.282.040.772.7
Faster R-CNN [49]70.080.670.157.349.978.280.482.052.275.3
WSDDN [15]39.450.131.516.312.664.542.842.610.135.7
OICR [16]58.062.431.119.413.065.162.228.424.844.7
PCL [35]54.469.039.319.215.762.964.430.025.152.5
MELM [54]55.666.934.229.116.468.868.143.025.065.6
Ours69.770.154.639.533.670.278.772.134.170.5
MethodDiningtableDogHorseMotorbikePersonPottedplantSheepSofaTrainTvmonitormAP
Fast R-CNN [34]67.979.679.273.069.030.165.470.275.865.866.9
Faster R-CNN [49]67.280.379.875.076.339.168.367.381.167.669.9
WSDDN [15]24.938.234.455.69.414.730.240.754.746.934.8
OICR [16]30.625.337.865.515.724.141.746.964.362.641.2
PCL [35]44.419.639.367.717.822.946.657.558.663.043.5
MELM [54]45.353.249.668.62.025.452.556.862.157.147.3
Ours45.964.563.175.731.827.959.261.569.868.958.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, J.; Wang, C.; Tan, X.; Zhang, M.; Wang, H. Feature-Guided Instance Mining and Task-Aligned Focal Loss for Weakly Supervised Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 1673. https://doi.org/10.3390/rs17101673

AMA Style

Tan J, Wang C, Tan X, Zhang M, Wang H. Feature-Guided Instance Mining and Task-Aligned Focal Loss for Weakly Supervised Object Detection in Remote Sensing Images. Remote Sensing. 2025; 17(10):1673. https://doi.org/10.3390/rs17101673

Chicago/Turabian Style

Tan, Jinlin, Chenhao Wang, Xiaomin Tan, Min Zhang, and Hai Wang. 2025. "Feature-Guided Instance Mining and Task-Aligned Focal Loss for Weakly Supervised Object Detection in Remote Sensing Images" Remote Sensing 17, no. 10: 1673. https://doi.org/10.3390/rs17101673

APA Style

Tan, J., Wang, C., Tan, X., Zhang, M., & Wang, H. (2025). Feature-Guided Instance Mining and Task-Aligned Focal Loss for Weakly Supervised Object Detection in Remote Sensing Images. Remote Sensing, 17(10), 1673. https://doi.org/10.3390/rs17101673

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop