Instance-Level Contrastive Learning for Weakly Supervised Object Detection

Weakly supervised object detection (WSOD) has received increasing attention in object detection field, because it only requires image-level annotations to indicate the presence or absence of target objects, which greatly reduces the labeling costs. Existing methods usually focus on the current individual image to learn object instance representations, while ignoring instance correlations between different images. To address this problem, we propose an instance-level contrastive learning (ICL) framework to mine reliable instance representations from all learned images, and use the contrastive loss to guide instance representation learning for the current image. Due to the diversity of instances, with different appearances, sizes or shapes, we propose an instance-diverse memory updating (IMU) algorithm to mine different instance representations and store them in a memory bank with multiple representation vectors per class, which also considers background information to enhance foreground representations. With the help of memory bank, we further propose a memory-aware instance mining (MIM) algorithm that combines proposal confidence and instance similarity across images to mine more reliable object instances. In addition, we also propose a memory-aware proposal sampling (MPS) algorithm to sample more positive proposals and remove some negative proposals to balance the learning of positive-negative samples. We conduct extensive experiments on the PASCAL VOC2007 and VOC2012 datasets, which are widely used in WSOD, to demonstrate the effectiveness of our method. Compared to our baseline, our method brings 14.2% mAP and 13.4% CorLoc gains on PASCAL VOC2007 dataset, and 12.2% mAP and 8.3% CorLoc gains on PASCAL VOC2012 dataset.

Due to lack of object bounding box position supervision, most current WSOD methods [14][15][16][17][18][19][20][21][22][23][24][25][26][27] use multiple instance learning (MIL) [28] to mine object instances from pregenerated proposals, and treat them as pseudo instance-level annotations to train weakly supervised detectors. However, these methods only focus on a single image to learn object representations without considering the internal relevance of various object instances across images. When there are object appearance variations in the complex diverse image scenes, it is easy to cause false detection. For example, when a horse is occluded in Figure 1, it focuses more on local feature representation, which is not enough to represent the whole object, resulting in the learned object instance only covering the head of the horse. To deal with the problem, we propose an instance-level contrastive learning (ICL) framework to store reliable instance representations from all learned images, and utilize contrastive learning [29] mechanism to explicitly establish semantic correlations with other image instances. It attempts to enhances the discriminative and robustness of object instance representations in the current input image, pulling it close to the instance representations of same class from all training images and pushing it away instance representations of different classes. As shown in Figure 1b, owning to instance correlation with other images, our method can effectively learn the instance representation for the whole horse.
To sufficiently represent diversity instances in all training data, we next propose an instance-diverse memory updating (IMU) algorithm. It mines reliable instance representations from proposal features and builds a memory bank with multiple representation vectors for each class to store them based on similarity, where background information is also consider to enhance foreground representations. Based on the memory bank, we further propose a memory-aware instance mining (MIM) algorithm. Unlike most methods [14,15,[20][21][22]26,27] that mine object instances only based on proposal confidence, we also compute the similarity with stored diverse instances to evaluate the completeness of proposals to mine more reliable object instances. Instead of selecting the top-scoring proposal as an instance, we also consider the multi-instance case to mine more instances in an image. During the training process of weakly supervised object detectors, we propose a memory-aware proposal sampling (MPS) algorithm to alleviate the imbalance problem between positive and negative samples. According to the similarity with instance representations, we select more positive proposals to increase the number of positive samples. Based on the similarity with the background information, we remove some negative proposals with low similarity to reduce the number of negative samples.
To verify the effectiveness of our method, we conduct extensive experiments on the PASCAL VOC2007 and VOC2012 datasets, which are widely used for weakly supervised object detection. In this paper, we adopt typical WSOD method OICR [15] as our baseline, which can be easily embedded into our ICL framework to further improve the performance. On PASCAL VOC2007 dataset, our method improves detection performance and localization accuracy by 14.2% and 13.4% in terms of mAP and CorLoc, respectively. On PASCAL VOC2012 dataset, our method improves performance by 12.2% mAP and 8.3% CorLoc.
The contributions of this paper are summarized as follows: • We propose an instance-level contrastive learning (ICL) framework to guide the weakly supervised detector to learn instance representations. To the best of our knowledge, we are the first to explore contrastive learning in weakly supervised object detection. • We propose an instance-diverse memory update (IMU) algorithm to store reliable instance representations into a memory bank, where multiple representation vectors are used in each class to maintain the diversity of instance representations. • With the help of memory, we further propose a memory-aware instance mining (MIM) algorithm to efficiently mine object instances by combining proposal confidence and instance similarity. • With the help of memory, we also propose a memory-aware proposal sampling (MPS) algorithm to alleviate the imbalance between positive and negative samples by finding more positive proposals and removing some unreliable negative proposals.

Related Work
In this section, we present the two most relevant to this paper, weakly supervised object detection and contrastive learning.

Weakly Supervised Object Detection
Since Hakan Bilen and Andrea Vedaldi proposed a weakly supervised deep detection network (WSDDN) [20] to combine MIL and CNN into an end-to-end network, most WSOD methods follow [14][15][16][17][18][19]21,22,26,27] this pipeline to train the weakly supervised detector. In MIL, an image is treated as a bag of proposals. If the image contains an object class, this bag is labelled as positive bag, i.e., at least containing one object instance of this class, otherwise labelled as negative bag. Due to lack of instance-level annotations, MIL is tend to get stuck in a local optimum to locate the most representative part of target objects. Subsequently, most researchers have proposed promising approaches to alleviate this problem. For instance, Kantorov et al. [14] proposed additional and contrastive context-aware guidance models to improve localization by using the surrounding contextual region of proposals. Tang et al. [15] proposed an online instance classifier refinement (OICR) method that uses spatial correlations between proposals to refine mined instances. Wan et al. [21] proposed continuation multiple instance learning (C-MIL) to alleviate the problem that MIL is prone to falling into local optima, which uses some smooth loss function to approximate the original non-convex loss function. Lin et al. [22] proposed object instance mining (OIM) framework to build spatial and appearance graphs of proposals to mine all possible object instances. Furthermore, some methods [16][17][18][19] introduce segmentation information to assist instance mining. Shen et al. [16] proposed a recurrent guidance strategy for weakly supervised detection and segmentation, where the detection module generates seeds for semantic segmentation and the segmentation module provides prior information for object detection. Yang et al. [17] proposed an objectness consistent representation method to exploit segmentation map to mine more high-quality proposals. Wei et al. [18] used segmentation context information around proposals to discover tight object bounding boxes. Li et al. [19] leveraged the segmentation map to reweight proposals scores. However, these methods only consider information from a single image, which are difficult to deal with diverse object instances. In this paper, our method explores the semantic correlation beyond the input image to assist object instance mining.

Contrastive Learning
Contrastive learning [29] has been widely used in unsupervised representation learning (e.g., SimCLR [30] and MoCo [31]), which compares positive and negative pairs to compress together different view representations of the same image and separate view representations of different images. In addition, contrastive learning-based methods [32][33][34][35] have also achieved promising performance in other vision tasks. For instance, Yan et al. [33] proposed a semantics-guided contrastive network that introduces contrastive learning into zero-shot object detection to transfer available semantic information for unseen classes. Wu et al. [34] proposed a contrastive learning-based robust object detection algorithm to detect objects under smoky conditions, which applies contrastive learning to maximize the consistency between different augmented views of the same smoke image. Li et al. [35] introduced contrastive learning into remote sensing image semantic segmentation to learn global and local image representations. However, these methods are difficult to directly apply to WSOD that learns the detector based on image-level annotations. In this paper, we introduce contrastive learning into weakly supervised object detection and propose an instance-level contrastive learning framework. To our best knowledge, we are the first to explore contrastive learning for weakly supervised object detection.

Method
In this section, we first describe our instance-level contrastive learning (ICL) framework in detail. Then, we present the instance-diverse memory updating (IMU) algorithm, memory-aware instance mining (MIM) algorithm and memory-aware proposal sampling (MPS) algorithm.

Instance-Level Contrastive Learning
In Figure 2, we present the pipeline of the instance-level contrastive learning (ICL) framework. Given an input image I and the corresponding proposals R generated by the proposal generation methods [36][37][38], we first extract image features F I using convolutional neural network. Based on the pre-generated proposals R, we convert image features F I to proposal features F R through a RoI-pooling layer, and use two fully connected (FC) Layers to obtain proposal vector representations F V . By mining reliable instance representations from F V , we then perform contrastive learning (CL) and store them in the memory bank M. In addition, F V is fed into several parallel detection heads, where a base head is supervised by the image label Y = [y 1 , y 2 , . . . , y C ] T ∈ R C×1 and K refined heads are supervised by the output results of previous heads, where C is the number of classes. In this paper, we set K = 3 to be the same as our baseline method [15].
Unsupervised representation learning [30,31] performs contrastive learning by augmenting image to different views, where views of the same image are pulled closer and views of different images are pulled apart. In this paper, we introduce the contrastive learning of object instance representations to guide the detector to learn the entire representation of the instance. Specifically, we first denote all outputs of refined heads as ({ϕ 1 , ϕ 2 , . . . , ϕ K }, {t 1 , t 2 , . . . , t K }). To mine more reliable instances, we average these outputs to obtain the proposal scores ϕ = 1 K ∑ K k=1 ϕ k and the bounding box coordinate offsets Applying the coordinate offset t to transform R, we obtain the transformed proposal P. Then, we exploit non-maximum suppression (NMS) to mine as many object instances as possible. For a positive class c, we use NMS to gradually select object instances from the transpose proposal P c according to the proposal score ϕ c from high to low, and remove redundant proposals. Then, we set a score threshold T 1 to obtain more reliable object instances D, and extract the corresponding instance feature representations F D from F V . The detailed procedure can be seen in Algorithm 1. For each mined instance representation q ∈ F D , we utilize the memory instance representation M including all training data to assist each instance learning in current image. Assume that a positive representation from memory bank k + ∈ M represent the same class as q, and a negative representation from memory bank k − ∈ M represent different classes. Then, we use the contrastive loss [39] to pull q close to k + of the same class while pushing it away from negative keys k − of other classes, and thus enhance the discrimination and generalization of current instance representation: where we take ϕ q as the loss weight and τ means the temperature hyperparameter.  There are one base head (Base-H) and three refined heads (R-H1, R-H2, R-H3). The base head is supervised by image labels, while each refined head is supervised by the previous parallel head. Dashed arrows indicate the supervision information. The detailed network structure of these heads can be found in the lower boxes, where each box corresponds to a submodule in the pipeline. Three refined heads have the same network structure but do not share parameters. The processes of memory-aware instance mining (MIM) algorithm, memory-aware proposal sampling (MPS) algorithm and contrastive learning (CL) are also shown in boxes.
Subsequently, we describe the training of heads. The base head has two parallel branches. One branch uses an FC layer to generate a matrix x c ∈ R C×|R| (|R| is the number of proposals), which is then input to a class-wise softmax layer: . In another branch, there is an FC layer and a proposal-wise softmax layer to generate another . Then, element-wise matrix multiplication is performed on these two matrices to generate proposal scores . Finally, the image class score is calculated by summing all proposal scores: φ = ∑ |R| r=1 x R . According to the image label Y = [y 1 , y 2 , ..., y C ] T , the loss of base head is computed by Equation (2).
For K refined heads, their training process is consistent. Specifically, for the k th head, there is a classifier and a regressor. In the classifier, an FC layer and a class-wise softmax are used to generate proposals scores ϕ k ∈ R (C+1)×|R| , where C + 1 means background is included. In the regressor, an FC layer is used to produce the coordinate offsets of proposals t k ∈ R 4C×|R| , where 4 means the dimension of coordinate offsets (x 1 , y 1 , x 2 , y 2 ). In order to generate their supervision, we first use the memory-aware instance mining (MIM) algorithm to mine multiple representative object instances B k based on the score outputs and offset outputs of previous head (ϕ k−1 , t k−1 ) and memory bank M. The details can be seen in Section 3.3. Then, we use the memory-aware proposal sampling (MPS) algorithm of Section 3.4 to sample negative and positive proposals (R pos , R neg ) from proposals R and assign labels for these proposals. For a positive class c, if a proposal p is selected as positive sample, p is labeled as class c, i.e., y k c,p = 1. All negative proposals R neg are labeled as background class C + 1. In this way, the classifier can be trained by a cross entropy loss: where we also use the confidence ϕ k p as the loss weight. For the regressor, only positive proposals R pos are used to calculate loss by the smooth L1 loss [4]: where T k is the supervision of coordinate offsets. In summary, our ICL framework can be end-to-end trained in Equation (5).

Instance-Diverse Memory Updating Algorithm
In order to enable the network to memory the instance representations from previous training images, we first initialize M ∈ R (C+1)×N×L , where N means the number of stored instance feature representations in each class and L represents the length of the feature vector F V . Since instances of the same class differ in size, shape, and appearance, we use multiple feature vectors to store richer instance representations instead of a single vector. We first use the Algorithm 1 to obtain some reliable instance representations F D from F V . For each instance representation f d,c ∈ F D , we calculate the similarity between f d,c and with M c = { f c,1 , f c,2 , . . . , f c,N } in Equation (6).
where || · ||, × and T mean L 2 normalization, matrix multiplication and transpose, respectively. Then, we select the most similarity feature f c,j from M c in Equation (7) to maximize the assistance of the current instance.
Finally, we update the feature vector f c,j according Equation (8) for the instance contrastive learning of subsequent images.
where r is the momentum coefficient [31] and ϕ f d,c is the confidence of instance representation, which aim to control the weight balance between previous instance representation and current representation. The whole process can be seen in the Algorithm 2.
Algorithm 2 Instance-diverse memory updating algorithm.

Input:
The pre-generated proposals R, the pre-defined score threshold T 1 , the image label Y, the memory bank M, the outputs of instance refined heads

Memory-Aware Instance Mining Algorithm
With the help of memory bank M, we propose a memory-aware instance mining (MIM) algorithm to effectively mine some reliable object instances. Different from our baseline [15], which only selects the top-scoring proposal as pseudo instance annotations, we comprehensively consider the confidence of proposals and the similarity between proposal features and memory bank covering previous training data to effectively mine object instances. Specifically, we first calculate the similarity S between F V and M according to Equation (9).
Then, we select the highest similarity along the N feature vectors and apply the class-wise softmax to generate memory-base confidence ϕ M through Equation 10.
For the k th branch in instance refinement heads, we further calculate the combination confidence ψ k in Equation (11).
where µ is the combination coefficient. Next, we use the NMS algorithm to remove redundant proposals and set a score threshold T 2 to remove unreliable proposals. In this way, we can obtain some reliable instances B k . More details can be found in Algorithm 3.

Algorithm 3
Memory-aware instance mining algorithm.

Input:
The pre-generated proposals R, the pre-defined score threshold T 2 , the image label Y, the memory bank M, the outputs of k th instance refined head (ϕ k , t k ) and the proposal feature vectors F V .
(I) obtain transformed proposals R t k by adding t k to R (II) calculate the memory-based confidence ϕ M by Equation (10) (III) compute the combination confidence ψ k with Equation (11) For

Memory-Aware Proposal Sampling Algorithm
After mining object instances, we further propose a memory-aware proposal sampling (MPS) algorithm to effectively sample positive and negative proposals from R. Some methods [15][16][17][18][19]26] simply divide R into two parts by computing the IoU with B k : highly overlapped proposals are taken as positive samples, and the rest are taken as negative samples, while ignoring the imbalance of positive and negative samples with overwhelmingly negative proposals. To alleviate this problem, we leverage the memory bank to select more positive proposals and remove some unreliable negative proposals. We first calculate the IoU between B k and R to separate R into two parts R k 1 and R k 2 in Equation (12).
Then, we extract the feature representations F R k 2 of R k 2 from F V . For each positive class c, we compute the similarity between F R k 2 and M c , and use Equation (13) to choose the most similar proposal p c into R 1 .
For the remaining proposals in R k 2 , we compute the similarity S R k |M T c |} between R k 2 and background information M C+1 and sort S R k 2 according to similarity from high to low. Finally, We removed the last low-similarity 1/λ proposals from R k 2 to obtain the negative samples. More details can be found in Algorithm 4.

Algorithm 4
Memory-aware proposal sampling algorithm.

Input:
The pre-generated proposals R, the image label Y, the memory bank M, the mined object instance (B k , ψ k B k ) and the proposal feature vectors F V . (I) positive samples R pos = ∅, negative samples R neg = ∅ (II) calculate IoU(B k , R) (III) separate R into two parts R k 1 and R k 2 using Equation (12). (2) select the most similar proposal p c from R k 2 by Equation (13) from from high to low (VIII) obtain R neg by removing the last low-similarity 1/λ proposals from R k 2 Output: R pos , R neg .

Test
After training, only the instance refined heads are used for testing. We perform the same operations as our baseline method [15]. We average the outputs of all refined heads to generate the final detection results.

Experiments
In this section, we first introduce experimental data and evaluation criteria, and elaborate on experimental details. Then we validate the advantages of our method by comparing with some recent methods. Finally, we conduct extensive ablation experiments to demonstrate the effectiveness of our method.

Datasets and Evaluation Measures
We conduct experiments on PASCAL VOC2007 [40] and VOC2012 [41] datasets, which are widely used in weakly supervised object detection setting [14][15][16][17][18][19][20][21][22][25][26][27]. In the PASCAL VOC2007 dataset, there are 9962 images belonging to 20 categories. These images are divided into three sets: train, val, test. According to the widely used WSOD setting, the trainval set (5011 images) is used for training. The PASCAL VOC2012 dataset has 22531 images split into train, val and test sets. The trainval set has 11540 images for training. It is important to note that all experiments have only image-level labels for training. For evaluation, there are two evaluation measures mean average precision (mAP [40]) and correct localization (CorLoc [42]). mAP is the standard PASCAL VOC protocol, which first computes the average precision (AP) for each class and then averages over all classes. AP for each class is obtained by calculating the area under the precision-recall curve. The mAP is used to evaluate performance on the test set. The second metric CorLoc is used to measure the localization accuracy of the trainval set. For each class, CorLoc is calculated as the ratio of images where at least one object is correctly localized. Both mAP and CorLoc are based on the PASCAL criterion. The object is considered to be successfully detected, when the intersection over union (IoU) between the ground-truth and predicted boxes is greater than 0.5.

Experimental Details
All experiments are performed on the Detectron2 (https://github.com/facebookresearch/ detectron2) deep learning framework and 4 NVIDIA GTX 1080ti GPUs. Following our baseline method OICR [15], we use the the VGG16 [2] model as our backbone, pre-trained on the ImageNet dataset [43]. For pre-generated proposals, we use multiscale combinatorial grouping (MCG) method [38] to generate approximately 2000 proposals per image. During the training phase, we set the learning rate to 0.001 for the first 28 epochs and divide it by 10 for the next 12 epochs. In addition, we set the momentum and weight decay to 0.9 and 0.0005, respectively. The mini-batch size is set to 4, i.e., an image is run by one GPU. Regarding the data augmentation, we use 5 scales {480, 576, 688, 864, 1200} to randomly resize the shortest side of the image and make the longest side no more than 2000, where the random horizontal flips are also used. During the test stage, We average the output of all augmented data to generate final detection results. For Hyperparameters in our method, we set K = 3, N = 5, µ = 0.1, 1/λ = 1/4, and T 1 = T 2 = 0.5. All settings in the PASCAL VOC2007 and VOC2012 datasets are the same.
For the PASCAL VOC2007 dataset, we present the detection performance (mAP) and localization accuracy (CorLoc) in Tables 1 and 2, respectively. In terms of mAP, our method ICL achieves a detection performance of 55.4%, which brings a significant improvement (about 14.2%) compared to our baseline OICR [15] (41.2%). Our method also outperforms methods [16][17][18][19] that exploit segmentation information to learn instance representation. For example, compare with [17], our method has an advantage of about 4.8%. Furthermore, our method also has some improvements (about 1.9%) compared to recent methods SLV [27] and D-MIL [25]. In terms of CorLoc, our method ICL achieves 74.0% localization accuracy. Compared with our baseline (60.6%), our method improves the performance by about 13.4%. Compared to the segmentation-assisted method WS-JDS [16] or SDCN [19], our method improves the performance by more than 7.2%. In addition, our method also shows significant advantages (more than 3%) compared to recent methods SLV [27] and D-MIL [25].  For the PASCAL VOC2012 dataset, we show both detection performance (mAP) and localization accuracy (CorLoc) in the Table 3. Our method achieves 50.1% mAP and 70.4% CorLoc, which are 12.2% and 8.3% improvement over the baseline, respectively. Compare to some recent methods [25,27], our method also bring some gains. In terms of mAP, our method outperforms the methods [27] and [25] by 0.9% and 0.5%, respectively. In terms of CorLoc, our method brings gains of 1.2% and 0.3%, respectively. These results further demonstrate the effectiveness of our method. -

Ablation Study
In this part, we conduct extensive experiments to further discuss the effects of main components of our method. Without loss of generality, all experiments are performed on the PASCAL VOC2007 dataset.
The effect of IMU algorithm. We first analyze the effect of IMU algorithm on our method ILC. In Table 4, after removing IMU, our method achieves 53.6% mAP and 72.9% CorLoc, respectively. There are 1.8% performance reduction and 1.1% accuracy reduction in terms of mAP and CorLoc, respectively, which proves the effectiveness of the instancediverse memory updating algorithm. Furthermore, we analyze the effect of the number of feature vector N on the IMU algorithm in Figure 3. We can see that both mAP and CorLoc first increase and then decrease as N increases. When N is too small, the memory bank is difficult to store the diversity of instance representations well, and when N is too large, it is easy to cause there are internal differences during the learning of instance representations. In this paper, we recommend setting N = 5 to balance the number of stored instance vectors.  The effect of MIM algorithm. As shown in the Table 4, MIM brings 5.7% and 4.4% gains to ICL in mAP and CorLoc, which shows the effectiveness of memory-aware instance mining algorithm (MIM). In addition, we also analyze the effect of memory on MIM by setting different combination coefficients µ in Figure 4. During the change of µ from 0 to 1, we can see that both the detection performance and the localization accuracy are the highest at µ = 0.1, which demonstrates that it is useful to introducing the similarity between memory features of previous training data and proposal features for mining effective instances. When µ becomes larger, the memory from previous images may hinder the learning of new instances from the current image, resulting in performance degradation. Therefore, we set µ = 0.1 in this paper. The effect of MPS algorithm. Removing MPS from ICL, our method achieves 52.9% detection performance (mAP) and 71.4% localization accuracy (CorLoc). There are 2.5% and 2.6% reductions in mAP and CorLoc, respectively, which shows the effectiveness of the memory-aware proposal sampling algorithm. In addition, we analyze the effect of the removal coefficient 1/λ on MPS in Table 5. We achieve the best performance when 1/λ = 1/4. Continuing to increase 1/λ may remove too many negative samples and affect the training of the detector. When 1/λ is too small, it cannot achieve the purpose of balancing positive and negative samples. The performance of COCO metrics. In Table 4, we also analyze the contribution of each component under the COCO metrics [44]. The performance of each component on AP, AP 50 , and AP 75 is similar to that under the PASCAL metrics. For AP S , AP M and AP L , the objects are divided into three sizes of small, medium and large for evaluation. Our method ICL can achieve the best performance on large objects, while removing MPS algorithm can achieve better performance on small and medium objects. Compared with small and medium-sized objects that are difficult to perceive, MPS algorithm is more conducive to sampling region proposals of large objects.
The analysis of training process. In Figure 5, we further provide training loss curves to verify the rationality of our method. We can see that the loss curves of refined heads rise first and then decrease to convergence. The rising phase of the loss is due to the weight, which is the confidence of mined object instances. At the beginning of training, the low discrimination of the model makes the confidence of object instances very low (almost close to 0). As model capabilities increase, the confidence starts to increase and so does the loss. Since confidence range is [0, 1], the loss will reach the maximum value, and finally start to decrease due to the enhanced model generalization until it converges. In Figure 6, we also provide the performance of the model during training. Both mAP and CorLoc continue to increase, which further demonstrates the effectiveness of our method. Qualitative results. In Figure 7, we provide qualitative results to more intuitively compare the proposed ICL with our baseline. On the trainval set of PASCAL VOC2007 dataset, we compare the learned object instances in Figure 7a. For the simple image in the first column, the baseline method can learn effective information about the car well. For the horse in the second column and the cow in the third column, when the foreground and background are relatively similar or the objects are occluded, the instances learned by the baseline method may contain more background. Our method ICL can learn more reliable object instances guided by instance correlations. On the test set, we compare the detection results in Figure 7b. Since the baseline method is more easily disturbed by background information during the training process, its detection results also contain more background information, such as the first two columns. Our method can better locate the boundary of the object. When there is interaction between objects, such as the third column, our method can also provide better detection results. For a more comprehensive analysis of our method, we present failure cases in the last column. For example, for the smaller aeroplane in Figure 7a, it is difficult for our method to learn its instance representation. Our method also fails to detect the highly overlapping sheep and distant little sheep in Figure 7b.

Conclusions
In this paper, we propose an instance-level contrastive learning (ICL) framework to guide the weakly supervised detector to learning entire instance representations by constructing instance correlations with other images. To store diverse object instance representations in a memory bank, we propose an instance-diverse memory updating (IMU) algorithm. With the help of memory, we further propose a memory-aware instance mining (MIM) algorithm to effectively mine object instances. To alleviate the imbalance of positive and negative proposals, we propose a memory-aware proposal sampling (MPS) algorithm. We conduct extensive experiments on PASCAL VOC2007 and VOC2012 datasets to verify the effectiveness of our method.
Our proposed method mines object instance representations from other images and stores them in a memory bank to guide instance learning on the current image. If the memory contains noisy representations, it will make the learned object instances inaccurate. The performance of weakly supervised detectors is also limited by the quality of the stored representations. In order to mine more reliable instance representations, our future studies will explore contextual information of region proposals or segmentation information of images to perceive object boundaries and locate object instances accurately.