In this section, we start with the problem statement of WSIS, followed by describing the proposed WS-RCNN framework. Then we detail the key components, including the Attention-Guided Pseudo Labeling and the Entropic Open-Set Loss. Finally, we present some remarks.
  3.1. Problem Statement
Given a set of classes of interest 
 and a training set 
, where 
 is an image and 
 is the corresponding multi-class label vector, the task of weakly supervised instance segmentation (WSIS) in our work can be roughly stated as to segment, for an input testing image, all the object instances belonging to the classes 
. Such a problem setting differs intrinsically from general instance segmentation [
2,
4] in that no pixel-wise instance annotations but only image-level labels are available for model establishment, which makes the task very challenging.
Like general instance segmentation, WSIS can also follow a proposed-based paradigm [
8,
13,
33], which can be epitomized as a three-step pipeline as aforementioned. These approaches can then be viewed as to retrieve true object instances from a pool of proposals according to the assigned scores, central to which is proposal scoring, i.e., how to appropriately assign classification scores to proposals. One commonly-used strategy for proposal scoring is to make use of the well-established localization ability of CNNs [
8,
16,
18]. Specifically, the training set 
 with image-level labels are firstly taken to train an image-level CNN classifier, from which a collection of class-specific attention maps are derived to assign classification scores to the proposals. For this purpose, it is desired that these attention maps can preserve object shapes, which is however a difficult perceptual grouping task. In addition, the hand-crafted scoring rules adopted by existing methods are also limited as well. These facts motivates us to propose the WS-RCNN framework.
  3.2. The Proposed WS-RCNN Framework
The basic idea of WS-RCNN is to deploy a deep network to learn to score proposals under the special setting of weak supervision, instead of relying on heuristic proposal scoring strategies. To achieve this goal, one major obstacle is the absence of proposal-level labels necessitated for training. To conquer this challenge, we develop an effective strategy, called Attention-Guided Pseudo Labeling (AGPL), to take advantage of the attention maps associated with the image-level CNN classifier to infer proposal-level pseudo labels. Furthermore, we introduce an Entropic Open-Set Loss (EOSL) to handle the background issue in training to further improve the robustness of our framework. In the following, we will first present an overview of WS-RCNN, followed by detailing the AGPL stategy and the EOSL loss.
Network Architecture: The overall network architecture of WS-RCNN is shown in 
Figure 2. Following the notations above, the input image 
 sized by 
 is first fed into a proposal generator (using the off-the-shelf method [
51,
52] in our implementation) to obtain the segment proposals 
, where each 
 is an 
 binary mask representing a segment proposal with arbitrary shapes (rather than regular bounding-boxes). The image then goes through a backbone CNN for feature extraction, yielding the feature maps 
, where 
 is the size and 
M the number of the feature maps. Afterwards, the network bifurcates into two branches, i.e., the proposal scoring branch and the pseudo labeling branch. Notice that these two branches share the same backbone CNN.
In the proposal scoring branch, the features corresponding to each individual proposal are extracted. A standard operation for this task is RoIAlign [
2], widely used in two-stage object detectors, which however cannot be directly applied to our case since it is designed for bounding-box proposals. Therefore, we modify RoIAlign to adapt to segment proposals, resulting in the SegAlign operation (see details below). For each proposal 
, the corresponding features can be extracted from 
 and aligned to a canonical grid via SegAlign, denoted by 
 (we use h = w = 7; M = 512 in this paper), which is followed by three fully-connected layers (FCs, with the node numbers being 4096, 4096 and C respectively) and a softmax layer to get the proposal-level classification score 
.
The pseudo labeling branch is executed for training only, where the feature maps 
 are followed by an image-level classifier. Then, a set of class-specific attention maps, denoted by 
, are extracted from this classifier, where each 
 reflects the spatial probability of occurrence of object instances belonging to the class c. Among possible choices of attention maps, we adopt the Class Peak Responses [
8] in our implementation due to its excellent localization ability. These attention maps (as well as the image-level label 
) are then utilized to infer the proposal-level pseudo class labels 
, where 
 is a one-hot vector 
 standing for the background class), by the use of AGPL.
Training Strategy: We adopt a two-phase training strategy to train the WS-RCNN model. In the first phase, we train the image-level classifier in the pseudo labeling branch, which is initialized by the model pre-trained on ImageNet. Proposal-level pseudo labels are then inferred from the trained imagelevel classifier using AGPL. In the second phase, we train the proposal scoring branch, where the backbone CNN is reinitialized with the model pre-trained on ImageNet. We will validate the effectiveness of this two-phase training strategy by comparative experiments in 
Section 4.3. Notice that since there usually exist significantly more background proposals than target-class ones after pseudo labeling, we always make their numbers identical by uniformly sampling background proposals.
Training Loss: For the training of the image-level classifier (the first-phase training), we use the given image-level labels  and the conventional cross-entropy loss function for multi-label classification to establish the training loss.
For the training of the proposal scoring branch (the secondphase training), suppose for the image 
 labeled with 
 in the training set 
, the proposals obtained are 
. For each 
, let us denote by 
 the classification score predicted by the proposal scoring branch, and by 
 n the pseudo class label inferred by the pseudo labeling branch using AGPL. Given all these, the training loss for the proposal scoring stream can be established by
        
        where 
 is the proposed Entropic Open-Set Loss (see details in 
Section 3.4).
SegAlign: As shown in 
Figure 3, following the notations above, suppose for a segment mask 
 in the 
 image 
, the corresponding receptive field mapped to the feature maps 
 is 
. SegAlign extracts from 
 the features corresponding to 
 and maps them to canonical feature maps 
, which is basically a modified RoIAlign to adapt to segment masks. Concretely, suppose 
 is bounded by the rectangle 
, 
 is the bilinear transform from the spatial coordinates 
 to 
, i.e., 
, and  
g is the bilinear interpolation function over 
. The SegAlign operation can then be defined by
        
Note we drop the channel dimension of feature maps above without loss of clarity.
  3.3. Attention-Guided Pseudo Labeling
AGPL leverages the localization ability of CNNs and the spatial relationship among proposals to achieve pseudo labeling. As shown in 
Figure 4, for the image 
, given the class label vector 
, the segment proposals 
 and the class-specific attention maps 
, AGPL can be outlined as follows:
(1) For each target class 
c (with 
), all the local maxima (peaks) are identified from 
, denoted as 
, where 
 stands for pixel coordinates and 
 the number of peaks. For each 
, we pick up all the proposals spatially including this point, which are further averaged and thresholded to get a support mask 
 as follows
        
        where 
 is the number of picked proposals corresponding to 
, 
p and 
q are pixel indices, and the threshold 
 is a parameter (we adopt 
 in our implementation. See 
Section 4.4 for parameter study). The resulting peaks 
 and associated support masks 
 are then utilized to admit proposals belonging to the class 
c.
(2) Sort all the peaks 
 in the descending order of their values in the attention maps, i.e., 
. Then, for each ordered peak 
 and the associated 
, those proposals which overlap sufficiently with 
 are labeled as the class 
c, i.e., 
 if
        
        where IoU stands for the Intersection-over-Union operation. Notice that one proposal is allowed to be exclusively assigned to one class only during the ordered labeling. For clarity, we summarize the AGPL algorithm above in Algorithm 1.
  3.4. Entropic Open-Set Loss
Since our WS-RCNN is proposal-based, the proposals after pseudo labeling will unavoidably contain some background proposals, i.e., those labeled as none of the target classes. It is necessary to handle these background proposals in model training, otherwise a model trained only with samples from target classes will be distracted by the unseen background proposals in testing, degrading its robustness. A natural solution is to add a dummy class into the model to accommodate background proposals. However, since the background class is a class of “stuff”, its variance is so large that it is hard to be modeled by any single class.
        
| Algorithm 1 Attention-Guided Pseudo Labeling (AGPL) | 
|  Input: The label vector , the segment proposals  and the class-specific attention maps  (associated with the image ); the parameter . Output: The one-hot pseudo class labels  for the proposals.1:Initialize , .2:for allc with  (target class) do3:  Find the local maxima  in ;4:  for  do5:    Calculate the support mask   using Equations (3 ) and (4 );6:    ;7:  end for8:end for9:Sort  in the descending order of the values , denoted by  the sorted set.10:;11:for alldo12:  Find the proposals indexed by   satisfying Equation (5 );13:  Set ;14:  ;15:end for
 | 
To address this issue, we observe that the task of background handling here is by nature an open-set recognition (OSR) problem, which has been well studied in robust pattern recognition. Hence, we propose to introduce the Entropic Open-Set Loss (OSEL), which is a representative method for OSR [
25], to address our background handling problem. The basic idea of OSEL is to treat the samples from target and background classes separately in establishing the training loss. For target classes, the standard cross-entropy loss is used, while for the background class, an entropic loss is used to encourage predicting uniformly-distributed classification scores. Since the 
C scores sum up to 1 (output by the softmax layer), encouraging uniform distribution on background class will make these scores small and therefore suppressed during the Non-maximal Suppression (NMS) procedure. Formally, suppose the predicted score vector of a proposal is 
 and the corresponding one-hot pseudo label vector is 
, the EOSL is defined by
        
To our knowledge, this is the first work which addresses the background handling problem in object detection/instance segmentation from the perspective of open-set recognition.