FCC-Net: A Full-Coverage Collaborative Network for Weakly Supervised Remote Sensing Object Detection

: With an ever-increasing resolution of optical remote-sensing images, how to extract information from these images e ﬃ ciently and e ﬀ ectively has gradually become a challenging problem. As it is prohibitively expensive to label every object in these high-resolution images manually, there is only a small number of high-resolution images with detailed object labels available, highly insu ﬃ cient for common machine learning-based object detection algorithms. Another challenge is the huge range of object sizes: it is di ﬃ cult to locate large objects, such as buildings and small objects, such as vehicles, simultaneously. To tackle these problems, we propose a novel neural network based remote sensing object detector called full-coverage collaborative network (FCC-Net). The detector employs various tailored designs, such as hybrid dilated convolutions and multi-level pooling, to enhance multiscale feature extraction and improve its robustness in dealing with objects of di ﬀ erent sizes. Moreover, by utilizing asynchronous iterative training alternating between strongly supervised and weakly supervised detectors, the proposed method only requires image-level ground truth labels for training. To evaluate the approach, we compare it against a few state-of-the-art techniques on two large-scale remote-sensing image benchmark sets. The experimental results show that FCC-Net signiﬁcantly outperforms other weakly supervised methods in detection accuracy. Through a comprehensive ablation study, we also demonstrate the e ﬃ cacy of the proposed dilated convolutions and multi-level pooling in increasing the scale invariance of an object detector.


Introduction
Remote sensing is an interdisciplinary subject involving technologies [1] such as aviation, optical instrument, electronic sensor and computer science, etc. Over the past few decades, the development of these individual disciplines has armed remote sensing with unprecedented tools that enable extremely high image resolution and fidelity. The ever-more-capable remote-sensing imaging has become an essential part of many important applications including high-precision map surveying [2], weather forecasting [3], land planning [4] and disaster prevention [5], etc. With the fast-growing size of remote-sensing image data, a pressing problem is to efficiently extract useful information from these high-resolution images. In general, information extraction for remote-sensing images contains a few components: scene classification, object recognition, object detection and semantic segmentation. Based on the idea above, we propose a full-coverage collaborative network for weakly supervised RSOD. The evaluation results on two large-scale remote sensing datasets TGRS-HRRSD (High-resolution Remote Sensing Detection) [20] and DIOR [21] show that the proposed model achieves promising accuracy even without any bounding-box annotations. To make the proposed technique more effective against the previously discussed challenges, we also develop several novel techniques and make improvements upon existing approaches. Our main contributions are summarized as follows: (1) We propose a novel end-to-end remote sensing object detection network (FCC-Net) combining a weakly supervised detector and a strongly supervised detector for addressing the challenge of insufficient labeled remote sensing data, which improves the performance of only using imagelevel labels training significant; (2) We design a scale robust module on the top of the backbone using hybrid dilated convolutions and introduce a cascade multi-level pooling module for multiple feature fusion on the backend of the backbone, which promisingly suppresses the sensitivity of the network on scale changes to further enhance the ability of feature learning; (3) We define a focal-based classification and distance-based regression multitask collaborative loss function that can jointly optimize the region classification and regression in the RPN phase; (4) Our proposed method yields significant improvements compared with state-of-the-art methods on TGRS-HRRSD and DIOR datasets.
The remainder of this study is organized as follows: Section 2 discusses related works in the field of RSOD, including common problems and difficulties to be solved in multiobject detection, as well as existing solutions. In Section 3, we present the proposed method and its main components in detail. Section 4 introduces experimental datasets and evaluations, and Section 5 discusses the experimental results. Finally, we conclude and discuss future work in Section 6.

Related Work
CNNs have achieved great success in the research of natural scene image object detection, with the emergence of two-stage detection algorithm represented by R-CNN [22], fast R-CNN [23] and faster R-CNN [18] and the single-stage detection algorithm represented by YOLO [24], RetinaNet [25] and CornerNet [26]. Due to the particularity of its sensors and shooting angle, there are many differences between remote-sensing images and natural scene images, such as small objects, high resolution, complex background, imbalanced classes, insufficient training examples, etc. Therefore, how to improve the accuracy of RSOD has become a very promising subject. In this section, we review the related components of deep learning that are used in the proposed network for RSOD tasks. Based on the idea above, we propose a full-coverage collaborative network for weakly supervised RSOD. The evaluation results on two large-scale remote sensing datasets TGRS-HRRSD (High-resolution Remote Sensing Detection) [20] and DIOR [21] show that the proposed model achieves promising accuracy even without any bounding-box annotations. To make the proposed technique more effective against the previously discussed challenges, we also develop several novel techniques and make improvements upon existing approaches. Our main contributions are summarized as follows: (1) We propose a novel end-to-end remote sensing object detection network (FCC-Net) combining a weakly supervised detector and a strongly supervised detector for addressing the challenge of insufficient labeled remote sensing data, which improves the performance of only using image-level labels training significant; (2) We design a scale robust module on the top of the backbone using hybrid dilated convolutions and introduce a cascade multi-level pooling module for multiple feature fusion on the backend of the backbone, which promisingly suppresses the sensitivity of the network on scale changes to further enhance the ability of feature learning; (3) We define a focal-based classification and distance-based regression multitask collaborative loss function that can jointly optimize the region classification and regression in the RPN phase; (4) Our proposed method yields significant improvements compared with state-of-the-art methods on TGRS-HRRSD and DIOR datasets.
The remainder of this study is organized as follows: Section 2 discusses related works in the field of RSOD, including common problems and difficulties to be solved in multiobject detection, as well as existing solutions. In Section 3, we present the proposed method and its main components in detail. Section 4 introduces experimental datasets and evaluations, and Section 5 discusses the experimental results. Finally, we conclude and discuss future work in Section 6.

Related Work
CNNs have achieved great success in the research of natural scene image object detection, with the emergence of two-stage detection algorithm represented by R-CNN [22], fast R-CNN [23] and faster R-CNN [18] and the single-stage detection algorithm represented by YOLO [24], RetinaNet [25] and CornerNet [26]. Due to the particularity of its sensors and shooting angle, there are many differences between remote-sensing images and natural scene images, such as small objects, high resolution, complex background, imbalanced classes, insufficient training examples, etc. Therefore, how to improve the accuracy of RSOD has become a very promising subject. In this section, we review the related components of deep learning that are used in the proposed network for RSOD tasks.

Small Objects in Remote Sensing Images
Due to the large size of remote-sensing images, it is easy to miss or error-detect the small objects with only tens or hundreds of pixels. As the number of network layer increased, the feature information Electronics 2020, 9, 1356 4 of 22 of small objects will gradually weaken or even disappear. To address this problem, a large number of detection algorithms have been developed, and these methods tend to fall into two main categories: The first category of RSOD methods are feature learning-based detection algorithms, which enhances feature expression of small objects by extracting or fusing multiscale features. For instance, using unconventional convolutions to reduce the loss of feature extraction [27][28][29][30], constructing additional branches to fuse more contextual features [31][32][33], introducing attention mechanisms to enhance the relationship between global and local pixels [34,35], etc. Another category of RSOD methods are sample postprocessing-based detection algorithms, such as intersection over union (IoU) [36,37] and non-maximum suppression (NMS) [38,39].

Insufficient Training Examples of Remote Sensing Images
Since only image-level labels are required for training, weakly supervised learning has achieved great attention in the field of object detection. Most existing works adopt the idea of multiple-instance learning [40][41][42][43][44][45] to transform weakly supervised object detection into multilabel classification problems. Among them, the most representative model is the two-stream weakly supervised deep detection network (WSDDN) [40], which multiplies the score of classification and detection streams to select the high-confidence positive samples. Many subsequent CNN-based works [41][42][43][44][45] are built on WSDDN. Due to the lack of accurate bounding-box labels, the location regression performance of the weakly supervised detector is far worse than the strongly supervised detector. Therefore, some recent works [41,42,[46][47][48] attempt to utilize the multiphase learning manner for weakly supervised object detection to improve the detection accuracy to a certain extent. However, these approaches cannot be directly applied to remote-sensing images. Thus, several efforts [49][50][51][52][53][54] are made to address the particularity of RSOD under weakly supervised learning. Although these methods achieve promising results, still have much room for improvement.

Foreground-Background Class Imbalance
In remote-sensing images, objects only occupy a small proportion of the large-scale image with complex backgrounds. Therefore, when generating proposals, a large number of proposals are the background, which dominates the gradient descent during training, resulting in a decrease in the performance of the detector. We can group the solutions for the foreground-background class imbalance into two: (1) hard sampling methods and (2) easy sampling methods. For example, the previous works [30,55,56] add online hard example mining (OHEM) to the detection networks for focusing on extracting features of hard samples so that the whole networks have a more robust classification capability. Focal loss can be introduced [57] or modified [58] to make the model focus on hard positive samples by reducing the weight of easy negative samples during training for improving the accuracy of vehicle detection.

Proposed Method
In this section, we present the proposed network and some related components that affect network performance in detail. This study primarily focuses on making full use of the image-level supervision information through a collaborative training network for realizing multiple-object remote sensing detection and the overall architecture of the FCC-Net is illustrated in Figure 2. As can be seen, it consists of three modules: (1) full-coverage residual network (FCRN), (2) cascade multi-level pooling module (CMPM) and (3) collaborative detection subnetwork. Moreover, we introduce focal loss and distance intersection over union (DIoU) simultaneously to define a focal-based classification and distance-based regression multitask collaborative loss function for generating proposals in the RPN phase. The strongly and weakly supervised detectors alternately optimize their own loss functions and cooperate with each other to further improve the detection performance under weakly supervised learning procedure. Specifically, training the FCC-Net contains the following three stages: (1) Fine-tuning FCRN to extract more image edges and details information and utilizing CMPM to fuse multiscale features for better correlation between local-global information; (2) Training a weakly supervised detector (WSD) using image-level labels and adopting the fine-tuning branch to refine the proposal results of the WSD for obtaining the final pseudo-ground-truths; (3) Training a strongly supervised detector (SSD) with the pseudo-ground-truths generated by previous steps and minimizing the overall loss function in a stage-wise fashion to optimize the training process.
Electronics 2020, 9, x FOR PEER REVIEW 5 of 21 and cooperate with each other to further improve the detection performance under weakly supervised learning procedure. Specifically, training the FCC-Net contains the following three stages: (1) Fine-tuning FCRN to extract more image edges and details information and utilizing CMPM to fuse multiscale features for better correlation between local-global information; (2) Training a weakly supervised detector (WSD) using image-level labels and adopting the finetuning branch to refine the proposal results of the WSD for obtaining the final pseudo-groundtruths; (3) Training a strongly supervised detector (SSD) with the pseudo-ground-truths generated by previous steps and minimizing the overall loss function in a stage-wise fashion to optimize the training process.
In the following sections, we will describe the implementation details of the three stages.

Full-Coverage Residual Network
Objects in high-resolution remote-sensing images usually appear in large resolution spans. Many researchers try to introduce dilated convolution to expand the receptive field-and thus, to extract features of the higher receptive field [59,60]. However, there is an inherent problem of dilated convolution called gridding artifacts, that is, the lack of interdependence between neighboring pixels that leads to the loss of local information. Given the above problem, this paper designs a novel backbone for RSOD by modifying ResNet-50.
We insert three 3 × 3 dilated convolutions with dilated rates of 2, 3 and 5 after the normal 3 × 3 convolution in the original bottleneck to form a contiguous dilated convolution combination with dilated rates of 1, 2, 3 and 5, thus constructing a new bottleneck block, namely full-coverage bottleneck (FC bottleneck), as shown in Figure 3a. Moreover, we add two shortcuts in the two contiguous dilated convolutions, respectively, which reuse the low-level features that contribute to object localization a lot. Compared with the dilated bottleneck in dilated residual network [55], the proposed FC bottleneck can acquire different levels of context information from a broader range of pixels without causing gridding artifacts. It also reduces the probability of dispersion or explosion that often occurs in gradient propagation due to the depth change. In the following sections, we will describe the implementation details of the three stages.

Full-Coverage Residual Network
Objects in high-resolution remote-sensing images usually appear in large resolution spans. Many researchers try to introduce dilated convolution to expand the receptive field-and thus, to extract features of the higher receptive field [59,60]. However, there is an inherent problem of dilated convolution called gridding artifacts, that is, the lack of interdependence between neighboring pixels that leads to the loss of local information. Given the above problem, this paper designs a novel backbone for RSOD by modifying ResNet-50.
We insert three 3 × 3 dilated convolutions with dilated rates of 2, 3 and 5 after the normal 3 × 3 convolution in the original bottleneck to form a contiguous dilated convolution combination with dilated rates of 1, 2, 3 and 5, thus constructing a new bottleneck block, namely full-coverage bottleneck (FC bottleneck), as shown in Figure 3a. Moreover, we add two shortcuts in the two contiguous dilated convolutions, respectively, which reuse the low-level features that contribute to object localization a lot. Compared with the dilated bottleneck in dilated residual network [55], the proposed FC bottleneck can acquire different levels of context information from a broader range of pixels without causing gridding artifacts. It also reduces the probability of dispersion or explosion that often occurs in gradient propagation due to the depth change.
We remain the first third stages of ResNet-50, then stack six and three FC bottlenecks in stages 4 and 5, respectively, replacing the original stages 4 and 5 to effectively enhance the correlation between long-ranged information. The whole structure of FCRN is shown in Figure 3b. The downsampling operation is removed in stages 4 and 5 so that the resolutions of output features are maintained at 1/8 of the original image. Moreover, stage 4 and stage 5 keep the same input channels as stage 3, i.e., 256, since contiguous dilated convolutions are time-consuming. Our experiments demonstrate that the proposed FCRN can increase the final receptive field to cover the entire area and avoid voids or loss of edge information, which directly improves the robustness of the detector to multiscale objects in remote-sensing images. Table 1 shows the architectures of ResNet-50 and our modified network. We remain the first third stages of ResNet-50, then stack six and three FC bottlenecks in stages 4 and 5, respectively, replacing the original stages 4 and 5 to effectively enhance the correlation between long-ranged information. The whole structure of FCRN is shown in Figure 3b. The downsampling operation is removed in stages 4 and 5 so that the resolutions of output features are maintained at 1/8 of the original image. Moreover, stage 4 and stage 5 keep the same input channels as stage 3, i.e., 256, since contiguous dilated convolutions are time-consuming. Our experiments demonstrate that the proposed FCRN can increase the final receptive field to cover the entire area and avoid voids or loss of edge information, which directly improves the robustness of the detector to multiscale objects in remote-sensing images. Table 1 shows the architectures of ResNet-50 and our modified network.

Cascade Multi-Level Pooling Module
To further strengthen the performance of the backbone in the feature learning stage, we introduce a cascade multi-level pooling module (CMPM) after the backend of FCRN. This module can freely fuse multiscale feature maps for adapting to the detection of various-sized objects in remote-sensing images, especially for strip objects in remote-sensing images. Figure 4 shows the structure of the proposed CMPM.

Cascade Multi-Level Pooling Module
To further strengthen the performance of the backbone in the feature learning stage, we introduce a cascade multi-level pooling module (CMPM) after the backend of FCRN. This module can freely fuse multiscale feature maps for adapting to the detection of various-sized objects in remote-sensing images, especially for strip objects in remote-sensing images. Figure 4 shows the structure of the proposed CMPM.  First, this module adopts the multi-level pooling layer with six different-size adaptive pooling kernels to reduce the dimension of the output feature maps of FCRN: F stage5 ∈ R H 1 * W 1 * C 1 , H, W and C represent the length, width and channel of the feature map, respectively.Then we can obtain contextual features at six different fixed spatial scales: Among them, the fifth and sixth pooling layers are specifically designed for difficult-to-detect objects in remote-sensing images, such as bridges and ships. Two small rectangular pooled features in vertical or horizontal directions P 5 and P 6 are added to further increase the feature representation of strip objects. In addition to the first level of pooling being average-pooling, the other five levels of pooling all utilize max-pooling.
Second, we use a 1 × 1 convolution to compress the dimension of P i to 1/8 of input feature channels for limiting the weight of global features in the subsequent feature fusion stage and acquire the intermediate features: C i = {C 1 , C 2 , C 3 , C 4 , C 5 , C 6 }. C i are gradually upsampled through a 3 × 3 deconvolution from top to bottom and further concatenated on the channel dimension layer-by-layer to obtain the fused features: F i = {F 1 , F 2 , F 3 , F 4 , F 5 , F 6 , F 7 }, which can prevent the information loss in the process of directly upsampling the minimum size to the maximum size.
Finally, we downsample the output feature maps of stage 2 in FCRN: F stage2 ∈ R H 2 * W 2 * C 2 , to the scale of F 7 obtained from the previous operation and perform three convolution operations after concatenation to get the final output of the backbone: F Out ∈ R H 3 * W 3 * C 3 . The convolution kernel sizes are 3 × 3, 3 × 3 and 1 × 1, respectively.
The module is defined as follows: where p avg ( * ) and p max ( * ) represent the operation of avg-pooling and max-pooling, f conv ( * ) represents the operation of convolution and activation function, f dconv ( * ) represents the operation of deconvolution and activation function, ⊕ represents the concatenation operation of feature maps on the channel dimension. The module we propose is similar to the FPN, but we only utilize the fused feature from stage 2 and 5 for detection, instead of making predictions on features of each level. The experimental results show that CMPM can enable abundant different coarse-fine-grained features to be shared and reused. Moreover, compared with the multiscale feature prediction and fusion operation in FPN, CMPM can maintain accurate feature expressions, especially for strip objects in remote-sensing images.

Collaborative Detection SubNetwork for Weakly Supervised RSOD
As mentioned above, since there are no bounding-box labels, it is a great challenge for weakly supervised methods to accurately predict the positions of the objects in remote-sensing images. Therefore, we attempt to balance this contradiction by using a two-phase training procedure, i.e., a multiple instance learning detector followed by a strongly supervised detector with bounding-box regression. In this work, we design a two-phase collaborative detection subnetwork with both multiple instance learning and bounding-box regression branches that share the same backbone, and we introduce it to the RSOD task. Specifically, we choose WSDDN as the baseline to generate pseudo-ground-truths and integrate faster R-CNN for more accurate bounding-box regression.

Weakly Supervised Detector (WSD)
As shown in Figure 3, selective search windows (SSW) [61] is first utilized to propose J w region proposals per remote-sensing image in WSDDN. Then these images are fed into our proposed backbone in the previous article and an ROI pooling layer. This can deal with the features of different scales and improve the robustness of the network to scale changes of input images. After two FC layers, the outputs of WSDDN are split into two branches: one branch predicts the probability p cls jc that the object in the proposal j belonging to class c; another predicts the probability p loc jc of the object contained in the proposal j according to its position. Next, the predictions of the two branches are multiplied by element-wise product to get the class labelp jc of the proposal j. Finally, we sum up the class labels from all proposals in a remote-sensing image as its corresponding predictionŷ c of the image-level multiclass label. The binary cross-entropy (BCE) loss is adopted to train the initial instance classifier with predict labelsŷ c and ground-truths y c , as shown in Equation (2): Moreover, we introduce the online instance classifier refinement (OICR) algorithm [41] into the WSD for further optimizing the local optimal problem of WSDDN in the regression phase. Specifically, we add three same fine-tuning branches (k = 3) in the subnetwork, which are parallel to the two branches in WSDDN. Each branch applies a max-out strategy to select the highest-confidence proposal and the proposals that overlaps it highly as reference so that filters out most redundancy predictions, as given in Equation (3). The loss function for each fine-tuning branch is Equation (4): where f k jc represents the output of the k th fine-tuning branch,ŷ k jc ∈ {0, 1} is image-level pseudo-ground-truth for each proposal. w k j represents the weight for each proposal. Unlike the original OICR, the weight w k j in our WSD module is modified by imposing a spatial restriction on negative labeling for alleviating the problem that multiple objects of the same class are easy to be mislabeled in remote-sensing images in Equation (5) as follows: where i t is a threshold, and we set i t = 0.1. For more details, please refer to [62]. After that, the final loss of the WSD is defined as Equation (6): Electronics 2020, 9, 1356 9 of 22

Strongly Supervised Detector (SSD)
The network initializes from the training of WSD and when the loss drops below a threshold, we assume that the region proposals with high scores proposed from the network are reliable enough. Therefore, these proposals are provided as pseudo-ground-truths to train a SSD, i.e., faster R-CNN. The purpose of this branch is to take advantage of faster R-CNN in bounding box regression since the pseudo-ground-truths generated in the previous two branches are too rough. To reduce the number of parameters and accelerate the network's convergence, SSD and WSD share the same backbone and the weights of part of FC layers.
Faster R-CNN can be regarded as composed of RPN and fast R-CNN, in which RPN is the core part. However, directly using RPN for RSOD tasks has two limitations: (1) RPN overcomes the impact of class-imbalance on network performance by setting the ratio of positive and negative samples, but it also results in losing the diversity of proposals during training; (2) RPN suffers from the problems of slow convergence and inaccurate regression in the bounding box regression process, especially for multiscale objects in remote-sensing images. Inspired by the observations, we attempt to optimize the loss function of RPN by introducing focal loss [25] and DIoU [63] simultaneously. In this study, we design a focal-based classification and distance-based regression multitask collaborative loss function to replace the original loss function in RPN for accurate bounding boxes and positive proposals in remote-sensing images. The flow chart of this loss function is shown in Figure 5. The formulation of L RPN f ocal_DIoU is given as follows: where p ic represents the predicted probability of the i-th proposal of class c, p * ic is the i-th proposal of class c. t ic and t * ic are the coordinate vectors of the predicted proposal and ground truth, respectively. λ is the weight of balancing the classification loss and bounding box regression loss of RPN. Focal loss L f ocal and DIoU loss L DIoU are expressed as follows, respectively: where α and γ are the hyperparameters. d c represents the diagonal length of the smallest enclosing box covering predicted anchor box and the ground-truth, d ρ represents the distance of central points of predicted anchor boxes and ground-truths. Then, the region proposals selected by NMS are output to train the subsequent fast R-CNN. Generally, training a fast R-CNN involves a classification loss and a bounding box regression loss. Due to the lack of refined bounding-box labels, the actual supervision information in our SSD branch is the pseudo-ground-truths (p jc , t jc ) generated by the first two weakly supervised branches. Considering that both the WSD and SSD are used to predict the object-bounding boxes, thus we adopt a consistency prediction loss to constrain the training of the two detectors. The prediction consistency loss consists of the loss between the SSD and the WSD and the internal loss of the SSD. The loss function for the prediction consistency loss is defined as: I ij (βL cls_inter (p ic , p jc ) + (1 − β)L cls (p ic ) + p jc L reg (t ic , t jc )) (10) where L cls_inter represents the consistency of class predictions between two detectors by using multiclass cross-entropy. L reg represents the smooth L 1 loss function and I ij indicates the overlap of the IoU of the regions proposed by the two detectors. I ij will set to be 1 if the IoU is greater than 0.5 and otherwise 0. β is a hyperparameter between 0 and 1 to balance the consistency of the predictions of the two detectors. The larger the β indicates that the SSD trusts the object-bounding boxes predicted by WSD more.
Electronics 2020, 9, x FOR PEER REVIEW 10 of 22 where and are the hyperparameters. represents the diagonal length of the smallest enclosing box covering predicted anchor box and the ground-truth, represents the distance of central points of predicted anchor boxes and ground-truths. Then, the region proposals selected by NMS are output to train the subsequent fast R-CNN. Generally, training a fast R-CNN involves a classification loss and a bounding box regression loss. Due to the lack of refined bounding-box labels, the actual supervision information in our SSD branch is the pseudo-ground-truths ( , ) generated by the first two weakly supervised branches. Considering that both the WSD and SSD are used to predict the object-bounding boxes, thus we adopt a consistency prediction loss to constrain the training of the two detectors. The prediction consistency loss consists of the loss between the SSD and the WSD and the internal loss of the SSD. The loss function for the prediction consistency loss is defined as: where _ represents the consistency of class predictions between two detectors by using multiclass cross-entropy.
represents the smooth loss function and indicates the overlap of the IoU of the regions proposed by the two detectors.
will set to be 1 if the IoU is greater than 0.5 and otherwise 0.
is a hyperparameter between 0 and 1 to balance the consistency of the predictions of the two detectors. The larger the indicates that the SSD trusts the object-bounding boxes predicted by WSD more.

Overall, Loss Function
After introducing our collaborative detection subnetwork, we can formulate the total loss function for FCC-Net. Our FCC-Net is trained by optimizing the following composite loss functions from the four components using stochastic gradient descent: Empirically, we set the hyperparameters = 0.25, = 0.8, = 2 and = 1 of the individual loss functions in the following experiments.
To make the overall training strategy of FCC-Net clearer, we summarize the process in Algorithm 1.

Algorithm 1 FCC-Net Algorithm
Inputs: remote-sensing image I and image-level labels

Overall, Loss Function
After introducing our collaborative detection subnetwork, we can formulate the total loss function for FCC-Net. Our FCC-Net is trained by optimizing the following composite loss functions from the four components using stochastic gradient descent: Empirically, we set the hyperparameters α = 0.25, β = 0.8, γ = 2 and λ = 1 of the individual loss functions in the following experiments.
To make the overall training strategy of FCC-Net clearer, we summarize the process in Algorithm 1.

Algorithm 1 FCC-Net Algorithm
Inputs: remote-sensing image I and image-level labels y c Output: Detection results 1 Step 1: Feature Extraction and Fusion 2 (1) Extract feature of I by the backbone FCRN

4
Step 2: Collaborative Training 5 while iterations less than max iteration or Loss larger than threshold do 6 (1) Propose region proposals of I by SSW

11
(4) Max-out on the predictions generated from the previous branch by Equation (3)

Datasets
We evaluate the superiority and generalization of our method on two large public multiclass remote-sensing image datasets as follows: Figure 6 illustrates the number of objects of each class on two datasets. 16 if the total loss Equation (11)

Datasets
We evaluate the superiority and generalization of our method on two large public multiclass remote-sensing image datasets as follows: Figure 6 illustrates the number of objects of each class on two datasets. TGRS-HRRSD dataset contains a total of 21,761 high-altitude images from Google Earth and Baidu Map. The minimum resolution of these images ranges from 0.15 m to 1.2 m, with a tremendous difference in resolution, which is difficult for the object detection task. The whole dataset contains a total of 55,740 target object instances in 13 categories. This dataset is divided into three subsets, in which the training set contains 5401 images, the validation set contains 5417 images and the test set contains 10,943 images. The trainval set and test set each account for 50% of the total dataset. DIOR dataset contains 23,463 high-altitude remote-sensing images and 192,472 object instances covered by 20 categories. The size of images in the dataset is 800 × 800 pixels and the spatial resolutions range from 0.5 m to 30 m. These images are carefully selected from Google Earth by researchers and this dataset has the largest scale on both the number of images and the number of object categories. Researchers fully consider the different weather, seasons, imaging conditions and image quality when collecting these remote-sensing images, which makes the background variations of the dataset rich and diverse. Moreover, since there are many categories of target objects, this TGRS-HRRSD dataset contains a total of 21,761 high-altitude images from Google Earth and Baidu Map. The minimum resolution of these images ranges from 0.15 m to 1.2 m, with a tremendous difference in resolution, which is difficult for the object detection task. The whole dataset contains a total of 55,740 target object instances in 13 categories. This dataset is divided into three subsets, in which the training set contains 5401 images, the validation set contains 5417 images and the test set contains 10,943 images. The trainval set and test set each account for 50% of the total dataset. DIOR dataset contains 23,463 high-altitude remote-sensing images and 192,472 object instances covered by 20 categories. The size of images in the dataset is 800 × 800 pixels and the spatial resolutions range from 0.5 m to 30 m. These images are carefully selected from Google Earth by researchers and this dataset has the largest scale on both the number of images and the number of object categories. Researchers fully consider the different weather, seasons, imaging conditions and image quality when collecting these remote-sensing images, which makes the background variations of the dataset rich and diverse. Moreover, since there are many categories of target objects, this dataset has a high interclass similarity and intraclass diversity, thus making it much challenging for the object detection task. This dataset is divided into three subsets, in which the training set contains 5862 images, the validation set contains 5863 images and the test set contains 11,738 images. The trainval set and test set each account for 50% of the total dataset.

Evaluation Metrics
To quantitatively evaluate the performance of the proposed method in this work, we adopt the widely used precision-recall curve (PRC), average precision (AP), mean average precision (mAP) and correct location (CorLoc).

Precision-Recall Curve
The PRC is characterized by precision as the Y-axis and recall as the X-axis, such that before generating the PRC, we need to calculate the precision and recall first. Precision refers to the proportion of correctly detected objects in all detected objects and recall refers to the proportion of correctly detected objects in all positive examples detected.

Average Precision and Mean Average Precision
AP is a more general metric in the field of object detection and information retrieval. As normally defined, the average precision refers to the average precision value within the interval from 0 to 1 for the recall rate, which is also the area under the precision-recall curve. Normally, higher average precision results in better model performance. Moreover, mAP refers to the mean of the AP for each class.

Correct Location
CorLoc is another common evaluation metric for weakly supervised object detection. CorLoc measures the accuracy of localization by calculating the proportion of the detected correct images to all true positive in the dataset. CorLoc is evaluated on the union of the training set and validation set, and AP is measured on the test set.

Implementation Details
The network proposed in this study can be trained end to end. The whole framework is implemented on Ubuntu 16.04 and Python 2.7 with Pytorch 0.2. In the training progress, we adopt the stochastic gradient descent (SGD) with a batch size of 4 for optimization. The momentum and weight decay are set as 0.9 and 0.0001, respectively. The proposed FCC-Net is trained by 20 K iterations, and the initial learning rate is set as 0.001 for the first 10 K iterations-then it decades to 0.0001 for the next 10 K. All backbones adopt ImageNet pre-trained weights when possible and then fine-tune on the two benchmarks used in this work. We conducted all experiments on a computer with a Intel Xeon E5-2650 v2 CPU and a single NVIDIA GTX TITAN-V GPU for acceleration. The whole training converges in 47 h on TGRS-HRRSD dataset and 44 h on DIOR dataset. Our method achieves~0.8 fps with an input of~1000 × 1000 pixels (i.e., TGRS-HRRSD dataset) and~1.1 fps with an input of 800 × 800 pixels images (i.e., DIOR dataset). Table 2 presents the quantitative comparison results of the eleven different methods in terms of AP and mAP on the TGRS-HRRSD dataset. To save space, we represent the class names with C1 to C13 according to the order in Figure 6a. The double underscore in table is used to distinguish between strongly supervised and weakly supervised methods for a clear illustration. It can be seen that our proposed FCC-Net significantly outperforms the other three traditional approaches and two weakly supervised methods in terms of mAP by a large margin of about at least 10.3% improvements. For classes of airplane, ship, ground track field and storage tank, our approach achieves better AP values than the WSDDN and OICR, which is also competitive compared with the other four strongly supervised methods. After our observation, we believe that this is because the objects of these categories have moderate size, high shape recognition and a relatively low probability of co-occurrence with other classes of objects, so they are easy to distinguish. However, for classes of crossroad, T Junction and parking lot, our method is far less accurate than the strongly supervised methods. Through the experimental results of faster R-CNN, we can find some commonalities with BoW method. For example, the AP value of basketball court and parking lot are low on both models. Moreover, compared with the two-stage models, we perform experiments on the single-stage model YOLOv2. The mAP value also reaches 65.8 and the AP value of each category is similar to faster R-CNN. Whether single-stage or two-stage models, the methods based on deep learning all show a similar performance, which we think is caused by the similarity of the backbone in feature extraction. Meanwhile, we can see the performance measured in terms of CorLoc in Table 3. Compared with WSDDN and OICR, our method obtains 19.1% and 13.3% gains, respectively. The possible explanation is that the introduction of the spatial restriction in fine-tuning branches leads to more makes objects of the same class remain and improves the ability of multi-instance mining. In general, although there is a certain gap between our method and the strongly supervised methods such as YOLO and faster R-CNN, we have achieved remarkable success in narrowing the gap between weakly and strongly supervised methods. The above results show that a variety of potential factors can affect the performance of RSOD. First of all, the complexity of the backbone network and the robustness of multiscale objects are the primary problems faced by remote-sensing images feature extraction. Different object sizes and spatial image resolution result in the limitation of the general image classification network in RSOD task, so we need to design a new backbone. Second, the size of objects in remote-sensing images is generally small. When the backbone is used to extract the high-level features of small objects, the image resolution is low so that some feature information may be lost, while the low-level features have high resolution and contain more position and detail information. Therefore, fusing multiscale features can enhance the relevance of context information and improve the performance of small object detection. Lastly, the optimization strategy in RPN can also enhance the robustness of the detection subnet to objects of different shapes and angles in remote-sensing images to a certain extent.

Evaluation on DIOR Dataset
To further test the effect of our method on remote sensing datasets with more categories, we also conduct some comparative experiments on the DIOR dataset. We test the dataset on faster R-CNN, and the mAP value reached 54.1%, with a performance degradation of more than 25% compared to the TGRS-HRRSD dataset, which is very consistent with the description of the dataset in Section 4.1.
Due to the complexity of this dataset, it is a greater difficulty for the backbone to extract features of high robustness, and the effect of the object detection is reduced consequently. In addition, the different spatial resolutions of the data make it hard to learn multiscale features in a relatively shallow backbone. To this end, we need to further modify the backbone, that is, expand both the depth and width. To verify this idea, we also carried out evaluations on RetinaNet, a single-stage detector. Resnet-50 and Resnet-101 are used as the backbones, respectively. In Table 4, the experimental results show that a deeper and more complex backbone can bring about performance improvements, increasing the mAP from 65.7% to 66.1%. The weak improvement of 0.4% may be due to the reason that RetinaNet itself has reached a saturation state on this benchmark. Furthermore, although more complex backbones can indeed extract features of high robustness, it may be difficult to improve further due to the limitations of the network itself. In the evaluation of our method, we set Res50, FRCN and FRCN-CMPM as the backbones, respectively, and carry out more comparisons. The results are consistent with the distribution of the TGRS-HRRSD dataset in Table 2, and mAPs are slightly improved. However, the optimal mAP on this dataset is nearly 30% lower than that on the TGRS-HRRSD dataset. The gap between these models is not large, which confirms our hypothesis of the dataset and the effectiveness of this work. In Table 5, we plot the AP values of each class in the DIOR dataset. To save space, we represent the class names with C1 to C20 according to the order in Figure 6b. The double underscore in table is used to distinguish between strongly supervised and weakly supervised methods for a clear illustration. It is can be seen clearly that the DIOR dataset is much more challenging than the TGRS-HRRSD dataset, the detection performance of two strongly supervised methods is much better than all these three weakly supervised methods. However, compared with the other two state-of-the-art weakly supervised methods, our method outperforms WSDDN and OICR with an improvement of 5% and 1.8% in terms of mAP, respectively. For the two classes of ship and bridge that are difficult to detect, our method achieves the best performance among the three weakly supervised methods. Moreover, for classes of baseball field, ground track field and stadium, our method is still very competitive compared with the two strongly supervised methods without any bounding-box supervision information. At the same time, we can see in Table 6 that our method improved by 9.3% and 6.9% in terms of CorLoc compared with WSDDN and OICR, respectively.
Combining the two evaluation metrics, we can find that our proposed network does not decrease in performance on more difficult and large-scale benchmarks. There are three main reasons as follows: (1) The combination of FCRN and CMPM fully improves the utilization and sharing of abundant multiscale features so that increases the network's ability to express the features of these hard instances; (2) The addition of three fine-tuning branches enables WSD to generate more accurate pseudo-ground-truths, providing to SSD for further bounding-box regression; (3) The multitask collaborative loss function mitigates the adverse effects of complex background and prevents the network from overfitting negative samples in the classification stage.
For a more intuitive understanding, we draw the PRC on the two benchmarks of the proposed network in this work. As shown in Figure 7, with the same Precision, the Recall of FCC-Net is higher than any of the comparison methods, which means that our method can detect more actual objects; with the same Recall, the precision of FCC-Net is higher than any of the comparison methods, which means that the false alarm rate of our method is lower. These observations adequately demonstrate that our method is highly competitive compared to other state-of-the-art weakly supervised object detection methods. hard instances; (2) The addition of three fine-tuning branches enables WSD to generate more accurate pseudo-ground-truths, providing to SSD for further bounding-box regression; (3) The multitask collaborative loss function mitigates the adverse effects of complex background and prevents the network from overfitting negative samples in the classification stage. For a more intuitive understanding, we draw the PRC on the two benchmarks of the proposed network in this work. As shown in Figure 7, with the same Precision, the Recall of FCC-Net is higher than any of the comparison methods, which means that our method can detect more actual objects; with the same Recall, the precision of FCC-Net is higher than any of the comparison methods, which means that the false alarm rate of our method is lower. These observations adequately demonstrate that our method is highly competitive compared to other state-of-the-art weakly supervised object detection methods.

Ablation Experiments
To evaluate the effectiveness of our proposed FCC-Net, we constructed some ablation experiments on the TGRS-HRRSD dataset to future analyze the contributions of the three key

Ablation Experiments
To evaluate the effectiveness of our proposed FCC-Net, we constructed some ablation experiments on the TGRS-HRRSD dataset to future analyze the contributions of the three key components in this work, such as FCRN, CMPM and multitask collaborative loss. In the ablation experiments, we gradually added the components used in this work and carry out comparative experiments. In Table 7, FCC-FCRN represents the collaborative subnetwork based on FRCN, FCC-FCRN-CMPM represents the addition of CMPM on this basis, FCC-FCRN-CMPM-MCL represents the further addition of multitask collaborative loss.
(1) Impact of full-coverage residual network. Under the collaborative detection subnetwork of VGG16 as the backbone, the mAP value reaches 46.6%, which proves that collaborative training has more advantages than the mere weakly supervised model. Then, we tried two deeper and more complex backbones, i.e., ResNet-50 and ResNet-101 and the results show that complex networks have more advantages in feature representation and can bring a certain degree of improvement. Our proposed FCRN outperforms ResNet-101 with an improvement of 0.4% in terms of mAP, which confirms that the FC bottleneck greatly reduces the loss of information during the feature extraction process and improves the detection performance of the entire network.
(2) Impact of cascade multi-level pooling module. Using FCRN and CMPM simultaneously, we find that mAP value is further improved by 0.6%. For classes of bridge and ship, FCC-FCRN-CMPM achieves better AP results than the previous weakly supervised methods. CMPM successfully deals with the problem of low feature recognition of strip objects in remote-sensing images, which future confirms that the multifeature fusion strategy is effective to mine objects under weakly supervised settings, especially for the objects of small size and special shape. The combination of FRCN and CMPM leads to that the scale robustness of the network is significantly improved.
(3) Impact of Multitask Collaborative Loss. We combined all the components proposed in this study to reach a mAP value of 48.3%, the accuracy of the model fluctuated slightly by 0.1%. The main reason is that the predicted bounding boxes of the objects extracted by PRN used in our framework are very dense, the calculation results of the IoU between the predicted bounding boxes and ground-truth boxes are always in an ideal horizontal interval. Thus, the improvement of our proposed network accuracy by multitask collaborative loss is relatively limited, but still much higher than WSDDN.

Qualitative Results
We present part of the detection results on the TGRS-HRRSD and DIOR datasets in Figures 8  and 9, respectively. It can be seen that our method can successfully detect most classes of objects and give precise and tight bounding boxes on both two datasets. When faced with multiple objects to be detected in a remote-sensing image, our method also shows better performance. However, we also observe that our method is easy to treat densely arranged small objects as a whole and cannot detect every single object, such as the class of ship and vehicle. This is also the direction we will focus on in our next work.

Conclusions
The traditional strongly supervised object detection methods require a large amount of finely labeled image data, which has a high cost and poor generalization performance. Considering the gap between remote-sensing images and general images, we propose a novel collaborative network (FCC-Net) in this study. We jointly train WSD and SSD in an end-to-end manner and feed the prediction results of the WSD into the SSD as the pseudo-ground-truths. This makes full use of their respective advantages to realize the collaborative optimization of region classification and regression. Moreover, a scale robust backbone is designed to enhance the feature learning of the multiscale objects in remote-sensing images. The quantitative evaluations on the TGRS-HRRSD and DIOR datasets demonstrate the effectiveness of the proposed method. Specifically, compared with two

Conclusions
The traditional strongly supervised object detection methods require a large amount of finely labeled image data, which has a high cost and poor generalization performance. Considering the gap between remote-sensing images and general images, we propose a novel collaborative network (FCC-Net) in this study. We jointly train WSD and SSD in an end-to-end manner and feed the prediction results of the WSD into the SSD as the pseudo-ground-truths. This makes full use of their respective advantages to realize the collaborative optimization of region classification and regression. Moreover, a scale robust backbone is designed to enhance the feature learning of the multiscale objects in remote-sensing images. The quantitative evaluations on the TGRS-HRRSD and DIOR datasets demonstrate the effectiveness of the proposed method. Specifically, compared with two previous state-of-the-art weakly supervised methods and four traditional methods, FCC-Net archives the substantial improvements in terms of mAP. Moreover, the detection performance of FCC-Net in several classes is still highly competitive compared with some state-of-the-art strongly supervised methods. Through experiments, we also reveal an important mechanism, that is, the deeper and more complex backbone has more advantages than the shallow backbone, which is more suitable for diversified RSOD tasks.