High-Quality Instance Mining and Dynamic Label Assignment for Weakly Supervised Object Detection in Remote Sensing Images

: Weakly supervised object detection (WSOD) in remote sensing images (RSIs) has attracted more and more attention because its training merely relies on image-level category labels, which signiﬁcantly reduces the cost of manual annotation. With the exploration of WSOD, it has obtained many promising results. However, most of the WSOD methods still have two challenges. The ﬁrst challenge is that the detection results of WSOD tend to locate the signiﬁcant regions of the object but not the overall object. The second challenge is that the traditional pseudo-instance label assignment strategy cannot adapt to the quality distribution change of proposals during training, which is not conducive to training a high-performance detector. To tackle the ﬁrst challenge, a novel high-quality seed instance mining (HSIM) module is designed to mine high-quality seed instances. Speciﬁcally, the proposal comprehensive score (PCS) that consists of the traditional proposal score (PS) and the proposal space contribution score (PSCS) is designed as a novel metric to mine seed instances, where the PS indicates the probability that a proposal pertains to a certain category and the PSCS is calculated by the spatial correlation between top-scoring proposals, which is utilized to evaluate the wholeness with which a proposal locates an object. Consequently, the high PCS will encourage the WSOD model to mine the high-quality seed instances. To tackle the second challenge, a dynamic pseudo-instance label assignment (DPILA) strategy is developed by dynamically setting the label assignment threshold to train high-quality instances. Consequently, the DPILA can better adapt the distribution change of proposals according to the dynamic threshold during training and further promote model performance. The ablation studies verify the validity of the proposed PCS and DPILA. The comparison experiments verify that our method obtains better performance than other advanced WSOD methods on two popular RSIs datasets.


Introduction
Object detection in RSIs is a pivotal task of imagery interpretation, its purpose is to identify and locate high-value geographical objects in RSIs.Object detection in RSIs has wide applications in various fields, such as environmental monitoring [1,2], urban planning [3], agriculture [4,5], anomaly detection [6,7], and so on.With the progression of machine learning [8][9][10][11][12][13][14], object detection acquires satisfactory performance.The advanced performance is obtained by the fully supervised object detection (FSOD) [15][16][17][18][19] methods.However, the FSOD method needs category and location labels for instances to drive model training.Obviously, manually annotating the location labels for each instance of each RSI is laborious.In order to alleviate the burdensome annotated costs, weakly supervised object detection (WSOD) methods [20,21] have gradually entered the view of researchers because WSOD methods only require image-level category labels to drive model training.
At present, most of the WSOD models are trained based on the paradigm of multiple instance learning (MIL) [22][23][24][25].Specifically, the training image is treated as a bag of latent instances, and then the latent instances are utilized to train the instance detector under the MIL constraints.Among these, a pioneering weakly supervised deep detection network (WSDDN) [26] has been developed, which first introduces MIL into the WSOD model.On the basis of WSDDN, an online instance classifier refinement (OICR) model [27] is developed by adding K instance classifier refinement (ICR) branches, which further improves the performance of the WSOD model.Subsequently, some works have been developed to further enhance the performance of WSOD through employing spatial correlation [28], initialization models [29], collaborative learning [30], etc.
Although the performance of classical WSOD has made significant progress, there are still two main challenges to be solved.The first challenge is that most of the WSOD methods [27,31] merely employ the proposal score (PS) to mine seed instances, however, high PS usually locates the remarkable region of an object but not the overall object.Unfortunately, these methods will obtain worse performance in RSIs with noisy background.The second challenge is that the traditional pseudo-instance labels assignment (PILA) strategy [27,31] cannot adapt to the quality distribution change of proposals during training.Specifically, the traditional PILA strategy sets a fixed label assignment threshold to determine the attribute (i.e., belonging to a positive or negative instance) of each instance.However, along with the training, the fixed threshold setting and dynamic model training are not matched, which is not conducive to training high-quality instances.
In order to tackle the first challenge, a novel high-quality seed instances mining (HSIM) module is designed to mine high-quality seed instances, as shown in Figure 1.Specifically, the proposal comprehensive score (PCS) is first designed and is composed of traditional proposal score (PS) and proposal space contribution score (PSCS).The PS indicates the probability that a proposal pertains to one category; the PSCS is calculated by considering the spatial relationships between top-scoring proposals and is utilized to measure the extent to which the proposal locates an object.Consequently, seed instances mined by PCS can better locate an object than traditional mined strategy, which merely utilize the PS.
In order to tackle the second challenge, an innovative dynamic pseudo instance label assignment (DPILA) strategy is developed to better adapt to the quality distribution change of proposals during training and, meanwhile, raise the number of positive instances in the initial training stage.Specifically, a label assignment threshold is dynamically calculated via elaborately designing a function that increases with the number of iterations.Consequently, the DPILA strategy can dynamically assign pseudo instance label for each instance, and further improves the performance of WSOD.
Our contributions can be summed up as follows.The first contribution is that a novel HSIM module is designed to mine high-quality seed instances.Specifically, a PCS is first designed, which is composed of traditional PS and proposed PSCS, where the PSCS is calculated by considering the spatial relationships between top-scoring proposals to estimate the wholeness with which the proposal locates an object.The seed instances mined by PCS can more completely locate an object than traditional mined strategies, which merely utilizes the PS; The second contribution is that a DPILA strategy is proposed to better adapt to the quality distribution change of proposals during training.Specifically, a dynamic label assignment threshold is defined by elaborately designing a function that increases with the number of iterations.The proposed DPILA strategy can dynamically assign a pseudo-instance label for each instance, which is conducive to model training; The third contribution is that the ablation studies verify the validity of PCS and DPILA.The comparison experiments display that our method obtains higher performance than other advanced WSOD methods on two popular RSIs datasets.Specifically, our method surpasses separately the state-of-the-art WSDDN, OICR, PCL, and MELM methods by 12.2% (8.3%), 12.8% (5.1%), 7.9% (3.4%) and 5.0% (2.9%) in terms of mAP on the NWPU VHR-10.v2(DIOR) dataset, and surpasses them by 23.2% (11.9%), 18.4% (9.5%), 13.3% (2.8%) and 8.5% (1.0%) in terms of CorLoc on the NWPU VHR-10.v2(DIOR) dataset.The overall framework of our method, which is established on the OICR network [27] by introducing two proposed modules including high-quality seed instance mining (HSIM) module and dynamic pseudo instance labels assignment (DPILA) strategy.Here, the HSIM is designed to mine high-quality seed instances.The DPILA strategy is proposed to better adapt to the quality distribution change of proposals during training.

Related Work 2.1. State-of-the-Art Weakly Supervised Object Detection Methods
Fully supervised object detection (FSOD) methods have achieved satisfactory performance.However, it needs category and location labels to drive model training, which is time-consuming to annotate with these precise labels.WSOD methods, which only require image-level labels to drive model training, have gradually entered the view of researchers.For example, Feng et al. [32] proposed a progressive contextual instance refinement strategy that can highlight more object parts and relieve the part domination problem.Yao et al. [33] proposed a dynamic curriculum learning strategy to robustly improve the performance.Feng et al. [34] proposed a triple context-aware network that can learn complementary and discriminative features and improve the performance of WSOD.Chen et al. [30] introduced the collaborative learning strategy into the WSOD model to improve its performance of WSOD.Feng et al. [35] proposed a self-supervised adversarial and equivariant network, that could learn complementary and consistent instance features, and promote the performance of WSOD.Chen et al. [36] proposed a full-coverage collaborative network, which could enhance the ability of multiscale feature extraction for WSOD detector.

Pseudo Instance Labels Mining
There are no instance-level labels to drive the model training in the WSOD.Therefore, it is a challenge to mine pseudo-instance labels for each instance.The current mainstream pseudo-instance labels mining strategy can be divided into two steps, namely, seed instances mining and pseudo-instance label assignment.The details of the two steps are as follows.

Seed Instances Mining
Most of the seed instance mining strategies [27,37,38] select the proposal with the highest score in category c as seed instance.However, the strategy ignores the plain fact that RSIs usually contain multiple instances in the same category, and it is unreasonable to only select the proposal with the highest score as the seed instance in category c.Therefore, some improvements have been proposed.For instance, Tang et al. [39] use the k-means method to split the proposals into several clusters according to proposal score, select the proposal with the highest score in each cluster, and then utilize graph-based method to choose multiple seed instances with same category.Lin et al. [40] consider that the same category instance should have a similar appearance feature.Specifically, by selecting the highest-score proposal as a seed instance in category c, then calculating the similarity between the seed instance and other instances, if the similarity of a certain proposal is greater than the pre-set threshold, the proposal is selected as another seed instance.Cheng et al. [41] proposed a self-guided proposal generation strategy to generate directly highquality seed instances.Qian et al. [42] proposed a novel seed instance mining strategy by employing the supplemental segmentation information.Ren et al. [31] sort all of the proposals from high to low according to the PS of existing categories in an image and then select proposals with the top p% score as the candidate seed instances.Finally, a similar non-maximum suppression (NMS) [43] operation is utilized to choose ultimate seed instances.

Pseudo-Instance Labels Assignment
Most of the WSOD methods [27,31,39,44] assign a pseudo-instance labels according to the fixed labels assignment threshold.Concretely, suppose an image contains category label c, the seed instance R si belonging to category c can be mined according to the abovementioned methods.Furthermore, the R si is labeled category c, i.e., y k cR si = 1 and y k c R si = 0, c = c , where k indicates the k-th ICR branch.Inspired by the reality that the proposals that have high spatial coverage with the seed instance should be assigned the same label.Specifically, if the maximum intersection over union (IoU) between a certain proposal and seed instances is greater than the fixed label assignment threshold of 0.5, then the proposals as neighbor positive instances are also labelled to category c and denote it to R npi , namely, y k cR npi = 1 and y k c R npi = 0, c = c , otherwise the proposals are labelled to background instance and denote it to R bi , namely, y k (C+1)R bi = 1 and y k cR bi = 0, c = C + 1.However, aforementioned methods merely employ the PS to mine seed instances, which leads to the mined instances inclining to locate discriminative regions of objects rather than overall objects.In addition, the fixed label assignment strategy cannot adapt to the quality distribution change of proposals, which is not conducive to training high-quality instances.These are also the problems to be solved in this paper.

Materials and Methods
As shown in Figure 1, the OICR framework [27] is employed as the baseline framework of the proposed method.On the basis of OICR, a novel high-quality seed instance mining (HSIM) module is designed to mine high-quality seed instances.Specifically, the PCS is first designed, which is composed of traditional PS and PSCS.The PS indicates the probability that a proposal pertains to a certain category; the PSCS is calculated by considering the spatial relationships between top-scoring proposals, which is utilized to measure the extent to which the proposal locates an object.In addition, a novel dynamic pseudo instance labels assignment (DPILA) strategy is proposed to better adapt to the quality distribution change of proposals during training and, meanwhile, raise the number of positive instances in the initial training stage.Specifically, a label assignment threshold is dynamically calculated by elaborately designing a function that increases with the number of iterations.

Basic Weakly Supervised Object Detection Network
Bilen et al. [26] put forward a path-breaking weakly supervised deep detection network (WSDDN), which is the footstone of WSOD.The details of the WSDDN are as follows.Firstly, preset an image I and image-level category labels Y = [y 1 , . . .y c , . . . ,y C ], where y c ∈ {1, 0} denotes present or absent object category c in an image, and C expresses the quantity of object category.For each image, a range of proposals R = {r 1 , r 2 , . . ., r |R| } are produced via employing edge boxes (EB) [45] or selective search (SS) [46] algorithms, where |R| expresses the quantity of proposals.Secondly, the feature maps F ∈ R W×H×C are obtained by sending the image I into the convolutional network (ConvNet), where C, H, and W indicate the channels, height, and width of the feature maps F. Thirdly, the feature maps F and the proposals R are sent into the region of interest (RoI) pooling layer to obtain the proposal feature maps F R with a fixed size.Fourthly, the proposal feature vectors are obtained via two fully connected (FC) layers.These proposal feature vectors are then sent into two side-by-side branches, i.e., classification branch and detection branch, to produce two matrices x c , x d ∈ R C×|R| through respective FC layers.The classification score and detection score of each proposal are obtained by performing a softmax operation on the two matrices x c , x d along different directions; the details are as follows: where [σ(x c )] cr indicates the probability that the proposal r pertains to category c, [σ(x d )] cr represents the dedication of the proposal r to category c.The 'dedication' indicates the contribution of a proposal r to the image being classified in category c.Therefore, the [σ(x d )] cr also belongs to the probability to a certain extent; namely, the higher the [σ(x d )] cr value, the greater the probability of belonging to a positive instance.The proposal score is calculated via element-wise product between σ(x c ) and σ(x d ), which is denoted as follows: where x ∈ R C×|R| represents the proposal score.Furthermore, image-level prediction score ϕ c of category c can be acquired by the sum of all proposals as follows: Finally, the loss function L WSDDN of WSDDN is defined as follows: where y c ∈ {1, 0} expresses the image-level category label, which indicates present or absent object category c in an image.
To further promote the performance of the WSOD model, Tang et al. [27] introduced multi-stage instance classifier refinement (ICR) branches to improve the WSOD network.Specifically, we added K parallel ICR branches on the WSDDN, and each ICR branch consists of a FC layer and a softmax layer, and the output (C + 1) dimension score matrix x k ∈ R (C+1)×|R| , where k ∈ 1, 2, . . ., K, and the (C + 1)-th dimension denotes background.The k-th ICR branch is supervised through the previous (k − 1)-th branch, excluding the 1-st ICR branch from WSDDN (i.e., x).Finally, K ICR branches are trained by utilizing the cross-entropy loss, which is formulated as follows: where the w k r denotes the loss weight, the y k cr ∈ {1, 0} indicates the pseudo instance label.For more details, please refer to [27].
However, most of the existing methods [27,31,39] merely employ the proposal score (PS) of proposal to mine seed instances, where the PS indicates the probability that a proposal pertains to one category.Specifically, the proposal with the highest PS in a certain category is selected as the seed instance.However, the proposal (seed instance) with the highest PS usually locates the remarkable region of object but not the overall object.Therefore, existing methods are not able to mine high-quality seed instances.

High-Quality Seed Instance Mining Guided by Proposal Comprehensive Score
To overcome the above challenge, the proposal comprehensive score (PCS) is designed, which comprehensively considers the traditional proposal score (PS) and the proposed proposal space contribution score (PSCS).The PSCS is calculated by considering the spatial relationships between top-scoring proposals and is utilized to measure the extent to which the proposal locates an object.Consequently, seed instances mined by PCS can more completely locate an object than the traditional mined strategies, which merely utilize the PS.The details of PCS are as follows.
Firstly, the proposals are sorted from high to low based on their corresponding PS in the existing category.Secondly, the proposals with the top p% PS in category c are selected as top-scoring proposals and defined them as an assembly R c = {r 1 , . . ., r n , . . ., r N }, where the N expresses the quantity of top-scoring proposals in class c.Thirdly, the PSCS of each top-scoring proposal is calculated pursuant to the spatial relationship between the top-scoring proposals.Fourthly, the PCS is calculated by combining the PS and PSCS, which are defined as follows: where PS cn indicates proposal score of the n-th proposal r n in category c, PSCS cn denotes the proposal space contribution score of r n in category c, α is the hyper-parameter to balance the contribution of PS and PSCS.The details of PSCS are as follows.
The undirected weighted graph G s c = (V s c , E s c ) is first constructed according to the spatial correlation of R c , where the vertexes V s c denotes top-scoring proposals, each edge E s c = {σ nn c } denotes the spatial correlation between vertexes.As shown in Figure 2, the weight of each edge is obtained via calculating the IoU between vertexes, which is defined as follows: where the T indicates hyper-parameter, the IoU(r n , r n ) indicates the IoU value between r n and r n , n = n .Based on this, the PSCS cn can be calculated as follows: where N(•) indicates the normalization operator.Finally, following the mining strategy [31], the PCS is utilized to mine high-quality seed instances, and denotes them as a assemble R s c = {r s 1 , . . ., r s m , . . ., r s M }, where the M denotes the number of R s c in category c.Here, the graph is not undirected but has weighted.Specifically, the vertexes of graph denote top-scoring proposals, each edge denotes the spatial correlation (i.e., IoU) between vertexes.

Dynamic Pseudo Instance Label Assignment for Each Instance
Most of the WSOD methods usually set a fixed instance label assignment threshold (i.e., IoU value) to determine whether a certain proposal belongs to the positive or negative instance.If the IoU value between the proposal r and its nearest seed instance r s m greater than or equal to the default threshold T IoU , the proposal is labeled as a positive instance; otherwise, the proposal is assigned a negative instance.Specifically, the label is defined as follows: where r / ∈ R s c indicates a certain proposal, T IoU is a fixed value and usually set to 0.5, which cannot adapt to the quality distribution change of proposals.In addition, setting a high T IoU may lead to the loss of some potential positive instances at the early stage of model training.
To overcome this issue, a dynamic pseudo instance label assignment (DPILA) strategy is proposed.The dynamic means that the label assignment threshold changes as the training progresses.Specifically, a growth function is designed to gradually adjust the IoU threshold as training goes on.The dynamic IoU threshold T d IoU is defined as follows, and its variation curve is also demonstrated in Figure 3.
where l and m denote hyper-parameters, t indicates the number of current iterations.Therefore, the label is redefined as follows:  During testing, the DPILA strategy is discarded (i.e., all experiment results are from the mean output of 3 ICR branches), and the threshold is a fixed value (i.e., 0.5) following the WSOD criterion [27,31,39].Extensive experiments are implemented to measure the validity of the proposed methods on the NWPU VHR-10.v2dataset [47,48] and DIOR dataset [49].The NWPU VHR-10.v2dataset comprises 1172 images, each with dimensions of 400 × 400 pixels, which has 879 trainval images and 293 test images and includes 10 object categories and 2775 instances.The DIOR dataset has a greater level of difficulty and includes 23,463 images, each with dimensions of 800 × 800 pixels.The DIOR dataset is partitioned into a trainval set, consisting of 11,725 images, and a testing set, comprising 11,738 images, which includes 20 object categories and 192,472 instances.

Evaluation Metric
We employed two standard metrics to evaluate the performance of our method, which are widely used and accepted evaluation metrics in WSOD, namely, mean average precision (mAP) and correct localization (CorLoc) [50], where mAP evaluates the accuracy of detection on the testing set and CorLoc assesses the accuracy of localization on the trainval set.The two evaluation metrics comply with the PASCAL protocol.

Implementation Details
The OICR network serves as the baseline framework for the proposed method.Similar to refs.[27,39,51], the VGG-16 [52] is utilized as the backbone network, which has undergone pre-training on the large-scale ImageNet dataset [8], in accordance with standard practice.The quantity of ICR branches is configured as 3. Following the standard of WSOD, merely image-level category labels of the trainval set are employed to train our model.We utilized the stochastic gradient descent (SGD) strategy to optimize our WSOD model, configuring values of 0.9 and 0.0001 for the momentum and weight decay hyperparameters, respectively.The initial learning rate and batch size is separately set at 0.01 and 8.We conducted a total of 20K and 60K training iterations on the NWPU VHR-10.v2 and DIOR datasets, respectively.The decay weight of the learning rate is set to 0.1, and the step size are separately set at 18K and 50K iterations on the NWPU VHR-10.v2 and DIOR datasets.The hyper-parameters l, m and p are separately set to 0.0002, 1 and 15.For data augmentation, all training images are augmented via rotating 90 • , 180 • and horizontal flipping [32,33].In addition, following the mainstream methods [27,39], the images are resized into five distinct scales {480, 576, 688, 864, and 1200} for training and testing.Inferential results are post-processed via implementing NMS operation, whose threshold is set at 0.3 [32,39,53,54].
The training details can also be seen in Table 1.The region proposals are generated via using the image segmentation algorithm (i.e., the selective search algorithm [46]).Specifically, the algorithm consists of the following three steps: (1) Initial segmentation: the image is segmented into small regions based on pixel intensity and texture similarity.(2) Similarity measure: all adjacent region pairs are combined and assigned a similarity score based on color, texture, size, and shape differences.(3) Proposals generation: the most similar regions are merged repeatedly until the desired number of proposals is obtained.Following the paradigm of WSOD, about 2000 region proposals are generated via a selective search algorithm.The scale of image segmentation is not fixed, which is determined according to the merger of similar regions in step (3).
All experiments are implemented on 8 TITAN RTX GPUs with the PyTorch framework.As previously discussed, the parameter α plays a critical role in determining the relative contributions of PS and PSCS.To objectively assess this relationship, we conducted a quantitative analysis of the DIOR dataset.As demonstrated in Figure 4, our approach achieved the highest mAP when α is 0.5.Based on these results, we adopted α = 0.5 as the optimal value for this paper.

Parameter Analysis of T
As mentioned before, T is the threshold to determine the value of σ , which is analyzed quantitatively on the DIOR dataset.As demonstrated in Figure 5, our approach achieved the highest mAP when T is 0.7.Based on these results, we adopted T = 0.7 as the optimal value for this paper.

Ablation Studies
Ablation studies are constructed to verify the validity of the PCS and DPILA.Specifically, as shown in Table 2, the baseline, baseline+PCS, baseline+DPILA, and base-line+PCS+DPILA experiments are implemented on the DIOR dataset.

Influence of PCS
The baseline+PCS experiment is constructed to validate the influence of the proposed PCS.As shown in Table 2, the baseline+PCS method obtains 20.3% mAP and 42.2% CorLoc on the DIOR dataset, which surpasses the baseline method 3.8% mAP and 7.4% CorLoc.Therefore, the validity of PCS is verified obviously.The major reason for performance enhancement is that the proposed PCS can effectively guide the WSOD model to mine high-quality seed instances, which further encourage model to locate more complete object.

Influence of DPILA
The baseline+DPILA experiment is constructed to validate the influence of the proposed DPILA.As shown in Table 2, the baseline+DPILA method obtains 18.9% mAP and 41.0% CorLoc, which outperforms the baseline method 2.4% mAP and 6.2% CorLoc on the DIOR dataset.Therefore, the validity of DPILA is verified obviously.The major reason for performance enhancement is that the proposed DPILA strategy can adapt to the quality distribution change of proposals during training and mine some potential positive instances at the early stage of model training.Consequently, the DPILA strategy can dynamically assign a pseudo-instance label for each instance, which further improves the performance of WSOD.
The baseline+PCS+DPILA experiment is constructed to verify the influence of the combination of PCS and DPILA.As shown in Table 2, the baseline+PCS+DPILA method obtains 21.6% mAP and 44.3% CorLoc on the DIOR dataset, which outperforms the other three methods.Therefore, the validity of the combination of PCS and DPILA is verified effectively.

Comparison in Terms of mAP
Tables 3 and 4 demonstrate the comparison in terms of mAP between our approach and other advanced WSOD methods.Specifically, as shown in Table 3, our approach obtains 47.3% mAP on the NWPU VHR-10.v2dataset.Compared with other advanced WSOD methods, our method significantly exceeds the WSDDN, OICR, PCL, and MELM by 12.2%, 12.8%, 7.9%, and 5.0% in terms of mAP, respectively, on the NWPU VHR-10.v2dataset.As shown in Table 4, our method obtains 21.6% mAP on the DIOR dataset.Compared with the other advanced WSOD methods, our method significantly exceeds the WSDDN, OICR, PCL, MELM, DCL, FCC-Net and CLN-RSOD methods on the DIOR dataset, with an increase in mAP of 8.3%, 5.1%, 3.4%, 2.9%, 1.4%, 3.3% and 3.3%, respectively.Compared with the FSOD methods, our approach further decreases the performance gap between FSOD method and WSOD method.

Subjective Comparison
In addition, to further evaluate our method, Four advanced WSOD methods that provide source codes are subjectively compared with our method on two RSI datasets in Figures 6 and 7, respectively.Figure 6 shows the visual comparison results on the NWPU VHR-10.v2dataset, and the objects with different categories are enclosed by utilizing the bounding boxes with different colors.Figure 7 displays the visual comparison results on the DIOR dataset, and the objects are enclosed by utilizing green bounding boxes.What is more, the category of object is attached to the bounding box.As shown in Figures 6 and 7, the detection results of our approach can completely locate and correctly identify objects.

Runtime Analysis
In order to assess the practicality of the proposed approach in real-world scenarios, we further reported the runtime of the proposed method in terms of training and inference.As shown in Table 7, during training, compared with the baseline method, the computational time increases from 24.8 to 30.4 h by incorporating the HSIM into the baseline method.The additional complexity is mainly introduced because HSIM is added.Furthermore, when we incorporate the DPILA into the baseline method, the computational time increased from 24.8 to 25.0 h, which is caused by the calculation of DPILA.During inference, the HSIM module and calculation of DPILA are discarded; namely, all experiment results are from the mean output of 3 ICR branches (as shown in the lower right of Figure 1).Therefore, all methods have the same complexity, which costs the same inference time (i.e., 2.2 h) during inference.Although the training time of the baseline method is less than ours (24.8 versus 30.7 h), its performance is reduced by 5.1% compared with ours.

Discussion
To tackle the first challenge, the detection results of WSOD tend to locate the significant regions of the object but not the overall object.The PCS, which consists of traditional PS and PSCS, is designed as a novel metric to mine high-quality seed instances.To tackle the second challenge, traditional pseudo-instance label assignment strategies cannot adapt to the quality distribution changes of proposals during training, which is not conducive to training a high-performance detector.A DPILA strategy is developed via dynamically setting the label assignment threshold to train high-quality instances.Consequently, collaborating on the proposed PCS with DPILA achieves better performance than other advanced WSOD methods on two popular RSIs datasets.Specifically, our method surpasses separately WSDDN, OICR, PCL, and MELM methods by 12.2% (8.3%), 12.8% (5.1%), 7.9% (3.4%), and 5.0% (2.9%) in terms of mAP on the NWPU VHR-10.v2(DIOR) dataset, and surpasses separately WSDDN, OICR, PCL, and MELM methods by 23.2% (11.9%), 18.4% (9.5%), 13.3% (2.8%) , and 8.5% (1.0%) in terms of CorLoc on the NWPU VHR-10.v2(DIOR) dataset.

Conclusions
In this paper, a novel HSIM module is designed to tackle the challenge that the detection results of WSOD detector tend to locate the significant regions of an object but not the overall object.Specifically, the PCS is first designed and is composed of traditional PS and proposed PSCS.The PSCS is utilized to evaluate the wholeness with which a proposal locates an object.Consequently, high PCS will encourage the WSOD model to mine high-quality seed instances.A DPILA strategy is developed to tackle the challenge that traditional pseudo-instance label assignment strategies cannot adapt to the quality distribution change of proposals during training.Specifically, a dynamic label assignment threshold is defined by elaborately designing a function that increases with the number of iterations.Consequently, the DPILA strategy can dynamically assign a pseudo instance label for each instance, which further improves the performance of WSOD.The ablation studies verify the validity of the proposed PCS and DPILA.The comparison experiments verify that our approach obtains better performance than other advanced WSOD detectors on two popular RSIs datasets.The subjective comparison straightforwardly demonstrates that our method can completely locate and correctly identify objects.
The shortcomings of the proposed model are that it achieves poor performance in individual classes such as Dam, Windmill, etc.The possible reason is that our model is susceptible to interference from complex backgrounds.For instance, the Dam is disturbed by the large reservoir, so the reservoir is often mistakenly identified as Dam.The Windmill is disturbed by the shadow of Windmill, so the shadow of Windmill is often mistakenly identified as Windmill.To improve the anti-interference ability of our model, we plan to design a novel feature enhancement module to enhance the feature extraction ability of WSOD.The high-quality feature is conducive to correctly identifying the object and enhances the robustness of the WSOD model.

Figure 1 .
Figure 1.The overall framework of our method, which is established on the OICR network[27] by introducing two proposed modules including high-quality seed instance mining (HSIM) module and dynamic pseudo instance labels assignment (DPILA) strategy.Here, the HSIM is designed to mine high-quality seed instances.The DPILA strategy is proposed to better adapt to the quality distribution change of proposals during training.

Figure 2 .
Figure 2. The details of weighted graph.Here, the graph is not undirected but has weighted.Specifically, the vertexes of graph denote top-scoring proposals, each edge denotes the spatial correlation (i.e., IoU) between vertexes.

Figure 3 .
Figure 3.The variation curve of dynamic IoU threshold.The horizontal axis represents the number of iterations, the vertical axis represents the IoU threshold.

Figure 4 .
Figure 4. Parameter analysis of α on the DIOR dataset.The horizontal axis represents different α values, the vertical axis represents the mAP values.

Figure 5 .
Figure 5. Parameter analysis of T on the DIOR dataset.The horizontal axis represents different T values, the vertical axis represents the mAP values.

Figure 6 .Figure 7 .
Figure 6.Four advanced WSOD methods that provide source codes are subjectively compared with our method on the NWPU VHR-10.v2dataset.

Table 1 .
The training details of our method, which includes training setting and parameter setting.

Table 2 .
Ablation studies of our method on the DIOR dataset.

Table 3 .
Comparisons with other advanced methods in terms of AP (%) and mAP (%) on the NWPU VHR-10.v2dataset.

Table 4 .
Comparisons with other advanced methods in terms of AP (%) and mAP (%) on the DIOR dataset.

Table 6 .
Comparisons with other advanced methods in terms of CorLoc (%) on the DIOR dataset.'-' denotes the CorLoc value has not been reported in their study.
Bold entities denote best results.

Table 7 .
The Complexity analysis of our method on the DIOR Dataset.All experiments are implemented on ubuntu16.04and NVIDIA TITAN RTX GPU.