Unsupervised Object Segmentation Based on Bi-Partitioning Image Model Integrated with Classiﬁcation

: The development of convolutional neural networks for deep learning has signiﬁcantly contributed to image classiﬁcation and segmentation areas. For high performance in supervised image segmentation, we need many ground-truth data. However, high costs are required to make these data, so unsupervised manners are actively being studied. The Mumford–Shah and Chan–Vese models are well-known unsupervised image segmentation models. However, the Mumford–Shah model and the Chan–Vese model cannot separate the foreground and background of the image because they are based on pixel intensities. In this paper, we propose a weakly supervised model for image segmentation based on the segmentation models (Mumford–Shah model and Chan–Vese model) and classiﬁcation. The segmentation model (i.e., Mumford–Shah model or Chan–Vese model) is to ﬁnd a base image mask for classiﬁcation, and the classiﬁcation network uses the mask from the segmentation models. With the classifcation network, the output mask of the segmentation model changes in the direction of increasing the performance of the classiﬁcation network. In addition, the mask can distinguish the foreground and background of images naturally. Our experiment shows that our segmentation model, integrated with a classiﬁer, can segment the input image to the foreground and the background only with the image’s class label, which is the image-level label.


Introduction
Presently, automatic image segmentation tasks are required to obtain accurate information about each region of an image because the number of images continues to surge. With the development of convolutional neural network (CNN) [1] in deep learning, there are many works for image segmentation. However, to achieve high performance of segmentation result in an supervised manner [2], the convolutional neural network (CNN) requires many ground truth data that show the area of the objects at the pixel level. Creating each of these data is cumbersome and requires much time and other resources. Some works solve image segmentation problems in unsupervised manners [3]. Mumford-Shah functional [4] and Chan-Vese algorithm [5][6][7] are well-known models for classical unsupervised image segmentation problems.
However, since the Mumford-Shah functional and Chan-Vese algorithm rely on pixel intensities and the objective functions are non-convex, these models cannot distinguish between the foreground and background of images, and the foreground and background results are changed depending on the initialized value of the networks' weights. To solve this problem, we propose a segmentation model that is integrated with a classifier for image segmentation. In our work, the segmentation model, which solves the Chan-Vese algorithm or Mumford-Shah functional, is used for the region proposals. Since these two algorithms minimize their objective functions with a curve in a level-set method [8], which evolves the initial surface defined as a function by minimizing the energy function, these algorithms can detect the edges of the images. With this segmentation model, we can get more precise boundaries. However, as we mentioned above, the results of the Chan-Vese algorithm and Mumford-Shah functional cannot distinguish the foreground and the background of the image alone. Therefore, the regions divided by these segmentation model alone have no meaning. To solve this problem, we integrate the segmentation model with the classifier. With the classifier, the detected meaningless regions by the output mask of the segmentation model become foreground regions. Furthermore, we propose a simple loss that can distinguish the background more precisely. With this simple loss, we can obtain more meaningful foreground regions. One thing to note is that with our classification loss, the classifier can extract the foreground regions alone. However, the regions' boundaries are not precise. Therefore, the segmentation model and the classifier work to complement each other.
For our experiment, we used the dog-and-cat dataset from kaggle [9] and PASCAL VOC 2012 [10] as a dataset. To this end, the contributions of this work are as follows: 1.
We integrate the segmentation model with a classifier to solve the image segmentation problem.

2.
We use the Mumford-Shah functional and the Chan-Vese algorithm for the segmentation model. Furthermore, with the segmentation model, we can achieve more accurate boundaries.

3.
For the classifier, we propose a loss function that can meaningfully distinguish between the background and the foreground.
In the remainder of this paper, we briefly present related works on image segmentation in Section 2 and then explain our network's structure with loss in Section 3. Next, we show our segmentation results in Section 4. Finally, we summarize and conclude our paper in Section 5.

Image Segmentation
Image segmentation refers to labeling each pixel of an image to a particular class, and it aims to separate a given image into several meaningful regions for more manageable analyses. It has been approached in classical manners such as those of Osher et al. [8] and Lloyd et al. [11]. However, with the development of convolutional neural networks (CNNs) [1] in deep learning, many studies solve the image segmentation problem through CNN. In this paper, we detect the foreground and the background of the input images by generating the binarized segmentation masks.

Classical Image Segmentation
The classical image segmentation methods are mostly based on mathematical or statistical methods. Tobias et al. [12] and Arifin et al. [13] use the characteristic histogram, Ma et al. [14] approach the segmentation problem with edge and boundary detection. Furthermore, classical variational methods solve the problem with clustering. These classical variational methods minimize their objective functions such the as Mumford-Shah functional [4].

Supervised Image Segmentation
To achieve high image segmentation performance in a supervised manner [2] with CNN-based models, UNet [15] uses skip connections between contracting and expansive parts. Furthermore, FCN [16] changes the fully connected layer to fully convoluntional network to preserve the location information. However, even though these methods have led to high performance, supervised learning has a considerable limitation: it must have ground truth for all training data.

Unsupervised Image Segmentation
To overcome the limitations of the supervised methods, some works solve the image segmentation problem with unsupervised methods [3] with CNN. These unsupervised learning methods segment images without any ground truth data. Usually, these unsupervised manners use the objective functions of the classical variational approach [4][5][6][7]. Similar to our work, Kim et al. [17] also minimizes the pixel-wise constant Mumford-Shah functional. However, the main difference is [17] uses pixel-level labels, but we use image-level labels for image segmentation for a weakly supervised method.

Weakly Supervised Image Segmentation
Although deep learning methods based on CNN can optimize the objective functions of the classical variational method, the results are not accurate because most of the objective functions are non-convex functions. Thus, many weakly supervised methods solve the image segmentation problem with much more simple objective functions and use given ground-truth class labels for segmentation. Zhou et al. [18] use global average pooling layers instead of fully connected layers with a classifier to visualize the most prominent part of the input image. Selvaraju et al. [19] uses the gradient to overcome the limitation of [18], such that [18] cannot be applied to the models without global average pooling layers. Lee et al. [20] extract the class activation map from a randomly selected feature map of classifier for their seed loss and combine it with the boundary loss, which consists of a conditional random field. More simply, Huang et al. [21] use only one class activation map for their seed loss. Wang et al. [22] construct a pixel correlation module to solve the problem that the class activation map is not consistent when the input resolutions change. References [20][21][22] use the class activation map of [18] for segmentation, but our work does not use class activation map. Instead, we directly extract the foreground regions by generating a binary mask from the input image, not from the feature maps of the classifier, which can enhance the classification confidence of the classifier. Similarly, Zolna et al. [23] generates a binary mask for input image, using classifiers to find all parts of the image that any classifier could use. Furthermore, Araslanov et al. [24] uses the classification loss to generate a mask and uses a local mask refinement module called PAMR to refine the mask. Unlike the method of elaborating the boundaries of segmentation areas detected by these works, we used the Chan-Vese model or the Mumford-Shah model.

Mumford-Shah Functional and Chan-Vese Algorithm
The Mumford-Shah functional [4] and Chan-Vese algorithm [5][6][7] are well-known approaches that solve the image segmentation problem in the classical method. These methods find optimal segmentation results, which minimize specific energy functions.

Chan-Vese Algorithm
The edge gradient-based energy function that [25][26][27] proposed cannot segment the images well when the edges are too smooth or the noises are too big. The Chan-Vese model [5] defines a region-based energy function based on Mumford-Shah functional [4]. The original Mumford-Shah functional is (1).
The energy function that Chan-Vese model [5] proposes is (2) F(c 1 , c 2 , C) = µ · Length(C) + ν · Area(inside(C)) where C is the curve at the Level-Set Method [8], and µ ≥ 0, v ≥ 0, λ 1 , λ 2 > 0 are fixed parameters. Length(C) is the length of the curve C. Area(inside(C)) is the area of the region inside C that is added to the original Mumford-Shah [4]. The equation of the original Mumford-Shah functional for segmentation is (1). c 1 and c 2 are constants depending on C, since c 1 is the average value inside C and c 2 is the average value outside C. Furthermore, u 0 is a given input image. In this energy function, the first term controls the length of C. The second term controls the area of C to control its size. The third and fourth terms are used to control the difference between the piece-wise constant model's result and the input image u 0 . To solve (2) with level-set method [8], Reference [5] uses Heavyside function H, and the one-dimensional Dirac measure δ 0 . These are defined in (3).
With this H and δ 0 , the energy function (2) can be re-formulated to (4).
where Ω is a bounded open subset of R 2 and φ : Ω → R is a zero level set of a Lipschitz function that represent the curve C. Furthermore, c 1 and c 2 are calculated with (5).

Mumford-Shah Functional
Unlike the Chan-Vese algorithm [5], piecewise-smooth Mumford-Shah functional [4] assumes c 1 and c 2 as the functions rather than constant values. In the piecewise-smooth Mumford-Shah functional, the c 1 and the c 2 in (4) and (5) are rewritten as R 1 (x, y) and R 2 (x, y) when we let R 1 ,R 2 : Ω → R.

Region of Interest Detection
Detecting the regions of the input image that stimulate the deep learning network is called Region of Interest detection. Li et al. [28] makes the soft masks for input images with classification loss. Fong et al. [29] makes spatial perturbation masks that maximally affect a model's output. More simply, Singh et al. [30] use randomly generated hidden image patches for each image at every iteration. Fong et al. [31] optimize the objective function with binarized masks that resize the area of input images. Furthermore, Jaderberg et al. [32] transform the input image with affine transformation matrix via classification loss.

Method
In this section, we describe two network structures and two corresponding loss functions for each network that segments the input image to the foreground and the background. Each network consists of a segmentation stage and a classification stage that classifies the segmentation results of the segmentation stage. As we mentioned, segmented results by the Mumford-Shah model and Chan-Vese model can detect the edges of the objects very precisely, but the results do not mean the foreground and the background. Since these two models segment the input images with pixel-intensity, the segmented regions consist of pixels with similar intensity. In our work, we use a classifier like [33][34][35][36], to make each segmented region, which is the result of the segmentation stage, meaning the foreground or the background. As shown in [18,19], some areas of the image, not the entire area, have a significant impact on the classifier. Therefore, when we train the classifier with multiplication of the outputs of the segmentation stage and the input image, the meaningless regions become the foreground and the background. Notice that the outputs of the segmentation stage are binarized with 0 and 1 to detect the foreground area of the input image. When we set the mask as φ and the input image as I, φ I means the foreground and (1 − φ) I. means the element-wise multiplication. However, we use (1 − φ) I as the foreground regions and φ I as the background regions, because it shows better results experimentally. In our experiment, the classifier is trained to classify the (1 − φ) I as the foreground (i.e., image-level label) and φ I as the background. For the foreground, we use the cross-entropy loss between the ground-truth image class label and the output of the classifier. Furthermore, for the background, we construct a loss. We explain this background loss in (13).

Network Structure
For our experiment, we exploit the network structure of [23]. We do not need to make additional encoders with this structure because all the decoders directly use classifiers' feature maps. Furthermore, for the classifier, we use ResNet-18 [37] structure. We experiment with two network structures. The first network, Figure 1, is composed of the Chan-Vese energy function, and the second network, shown in Figure 2, is composed of piecewise-smooth Mumford-Shah energy function. In Figures 1 and 2,ô,f , andb are the classification score of the original input image, foreground image, and background image, respectively. Notice that because we use a same classifier for the original image, foreground image, and background image like Siamese networks [38], which share the weights, the classifier's weights are updated by theô,f , andb. We indicate the classifier as classi f ierA in each of Figures 1 and 2. φ is the generated binary mask and 1 − φ is the reversed generated mask. Furthermore, in Figure 2, F is the foreground matrix and B is the background matrix for the calculation of the piecewise smooth Mumford-Shah functional loss (10).

Network with Chan-Vese Energy Function
The mask φ is a one-channel image and has the same height and width as the input image. Furthermore, since we use the sigmoid function instead of the HeavySide function (3), the mask has only 0,1 values when the regularization term |1 − φ| is used together. The upsampled feature maps from the classi f ier A used for the original input image are concatenated and pass convolutional layers. The output of the last convolutioinal layers is a mask that shows the background of the image. To get the foreground of the image, we reverse the mask by subtracting the mask from 1. The foreground image and background image pass the classifier, which we use for the original image. The classi f ier A classifies the original images, the foreground, and the background of the input image. φ and 1 − φ are used for Chan-Vese loss (7) and (8).b andf are used for classification loss (14).

Network with Piecewise-Smooth Mumford-Shah Energy Function
As in Figure 1, we use ResNet-18 for the classi f ier A, and the mask φ is a one-channel image and has same the height and length as the input image. The only difference is that there are three decoders. The first one is for a mask φ, the second one is for a foreground matrix F, and the last one is for a background matrix B. The foreground matrix F and the background matrix B are three-channel images and have the same height and length as the input image. φ, 1 − φ, F, and B are used for Mumford-Shah loss (10).b andf are used for classification loss (14).

Loss Function
Our total loss (6) can be divided into two parts. The first part is a segmentation loss for the segmentation stage, and the second part is a classification loss for the classification stage.
L total loss = αL segmentation + L classi f ication (6) L segmentation uses either the energy function of the Chan-Vese algorithm (4) or the energy function of the Mumford-Shah, neither of which calculates the mean values of the each region, for our deep learning network, reflecting some modification. L classi f ication is composed of classification loss about the foreground region and the background region.
3.2.1. L segmentation • Chan-Vese Energy Function for Segmentation: To apply the Chan-Vese algorithm to our network, we modify (4) to (7). Where I is the input image, φ is the mask, and ∇φ is the derivative of the mask with forward difference.
Furthermore, c 1 and the c 2 are calculated with (8). The c 1 is the mean value of the first region, and the c 2 is the mean value of the second region.
In (7), the first term and the second term divide the input image into two regions with similar pixel intensity. With the 1 − φ(x, y), which only has values 0 and 1, and the constant mean value c 1 , the first term only considers the first regions regardless of the second regions, and the first regions are grouped into the pixels of similar value. Similarly, with the φ(x, y) and the constant mean value c 2 , the second term only considers the second regions regardless of the other regions and the pixel values in the second regions have similar values. The third term controls the regions' size of the mask, and the fourth term controls the noise of the mask. We call each loss term foreground fidelity, background fidelity, mask region regularization, and mask smooth regularization. For a given mini-batch of training set {(I 1 , y 1 ), ...,(I N , y N )} when the y i is the ground-truth image level labels (i.e., images class labels) of mini-batch of input images y i , we set (7) as (9). E(I, φ, c 1 , c 2 ) is the sum of the foreground fidelity term and the background fidelity term. R(φ) is the regularization of φ, and M is the size of mini-batch.
• Piecewise-smooth Mumford-Shah Energy Function for Segmentation: As with the relationship between (4) and traditional piecewise-smooth Mumford-Shah Energy Function, the differences between each loss function of our two networks are that constant values c 1 and c 2 of (7) are changed to F(x, y) (i.e., foreground matrix) and B(x, y) (i.e., background matrix). The loss function is (10).
The terms in L segmentation_mum f ordShah are foreground fidelity, background fidelity, mask region regularization, mask smooth regularization, foreground smooth regularization, and background smooth regularization in order. With the 1 − φ(x, y) and the foreground matrix F(x, y), the first term only considers the remained regions by the 1 − φ(x, y) regardless of the other regions, and the remained regions resemble the same regions of the input image. Same as the first term, the remaining regions of the input image by the second term depend on the φ(x, y). The foreground smooth regularization and the background smooth regularization work with the foreground fidelity and the background fidelity, respectively, to control the smooth of the foreground matrix and the background matrix. These terms work closely with mask smooth regularization and adjust the smooth of the mask. The effect of the mask region regularization and the mask smooth regularization are the same as (7). Furthermore, we set (10) as (11), where E(I, φ, c 1 , c 2 ) is the sum of the foreground fidelity term and the background fidelity term. Furthermore, R(φ, F, B) is the summation of the regularizations of φ, F, and B.
Notice that the divided regions of these two models' outputs are not the foreground and the background of the input images. The Chan-Vese algorithm and the Mumford-Shah functional cannot distinguish the foreground and the background, because these energy functions are based on pixel intensities, like K-means clustering [11]. Because these two models divide the images into two regions regardless of the meaning of each region, we use a classifier to make each region be the foreground and the background of the input images. We use this L segmentation as a regional proposal to the next stage (i.e., classifier).

L classi f ication
Our classification loss (12) consists of the foreground classification loss and the background classification loss.
The L f oreground is a classification loss for the foreground image (i.e., (1 − φ) I), and the L background is a classification loss for the background image (i.e., φ I). Operator means element-wise multiplication.ô,f , andb are the vectors and the size is mini batch size × number o f classes. The values of the vectors are the probabilities corresponding to each class. For the foregrounds, we use cross-entropy loss. However, it is impossible to find common features of the backgrounds with a classifier because the components of the backgrounds are too diverse to assign them to the same label. Therefore, we formulate a loss (13) that can distinguish the background from the foreground.
To minimize this (13), all the values ofb, meaning probability values belonging to each class, must be 0.5. This means that the backgrounds do not belong to any given class. In (13), the constant inside the log is set to 4 to make the minimum zero. With the foreground loss and our background loss, we set (12) as (14).
With this (12), the proposed regions from L segmentation gradually change into the regions that have the meaning of foreground. It is remarkable that, like (7) and (10), when (12) works with two regularization terms in (15), (12) can generate the mask of the input image I that extracts the regions as foreground. We compare the result in Section IV.

Algorithm
We solve the two-stage segmentation problem with the Stochastic Gradient Descent (SGD) [39] method. For convenience, we rearrange the terms as follows; the classifier C, the decoder D, the mask φ, the foreground matrix F, and the background matrix B. The only difference between using (7) and using (10) for segmentation is the number of D.

Chan-Vese Algorithm
First, to extract the feature maps l, which are the input of the decoder D, we use the original input Image I. Using the original image for the classifier is more helpful for φ to detect the whole part of the foreground of I.
With the feature maps l from the classifier C and the original image I, the D generates the φ.
With the φ, we compute the c 1 and c 2 with (8) and get the probability of foregroundf and backgroundb. The probability refers to the degree to which foreground and background belong to a particular class.f The loss function for the Chan-Vese algorithm for the mini-batch i at step t is The parameters of each network's component are updated by gradient descent algorithm.
Where θ is the parameters of the networks, η (t) is the learning rate at step t, and CE is the cross entropy loss, CE(ô We show this process with Algorithm 1.

Piecewise-Smooth Mumford-Shah Algorithm
The differences with the Chan-Vese algorithm are shown in (21).
Therefore, the loss function for the Mumford-Shah Algorithm about the mini-batch i at step t is Furthermore, the parameters of the networks are updated: This process is shown in Algorithm 2.

Training and Testing Details
For training, we use ResNet-18 [37] as a classifier. Since we use the classifier's feature maps for the decoder's input, we do not construct a encoder. This is more efficient because we do not need additional parameters [23]. Furthermore, for the decoder that generates the masks, we upsample the each feature map of the classifier to 56 × 56 and apply a 1 × 1 convolution layer with the ReLU activation function to make each input's channel be 64 since the last feature map of the classifier has 64 channels and the first feature map's size is 56 × 56. Furthermore, after concatenating the all upsampled feature maps, we apply a convolution layer (kernel with a size of 3, stride 1, and padding 1), upsample to 224 × 224, and apply the sigmoid function to make the mask within 0 and 1. For the pre-processing, we resize the input images to the 224 × 224 and normalize the input images value from 0 to 1 by x i −min(I) max(I)−min(I) , when I is the input image and x i is the value of i − th pixel. Furthermore, for each dataset [9,10], we train and evaluate the network separately.
Furthermore, for the test to generate the masks, we use the same pre-processing for the input images. The network for the test is shown in Figure 3. The 1 − φ is our segmentation result for an input image, which segments the input image to the foreground and the background.

Qualitative Comparisons
With the [9] dataset, we conducted three additional experiments to show the effectiveness of each loss term (classification loss (12) and segmentation loss (7),(10)): Generate mask only with classification loss, Chan-Vese network without any classification loss, and Mumford_Shah network without any classification loss. The experiment "Generate mask only with classification loss" is conducted through the structure of Figure 1, except for the Chan-Vese loss. In this experiment, the entire loss consists only of classification loss (12), and the generated masks are learned only by classification loss, similarly to [23]. For the two experiments "Chan-Vese network without any classification loss" and "Mum-ford_Shah network without any classification loss", we make an auto-encoder network. For the encoder, we use the Vgg-16 [40], and the decoder is constructed symmetrically with the encoder using the transposed convolution. There are no skip connections between the encoder and the decoder. In these two experiments, only the segmentation loss (Chan-Vese loss (7) and Mumford-Shah loss (10)) affects the masks.  Figure 4 shows the results of these three experiments and our main two experiments' (Figures 1 and 2) results. For the baseline, we use the results of CASM [23] that segments the image to the foreground and the background, because we use the same network structure as CASM. For each set of result images, the first row is the input image, the second row is the generated mask, the third row is the foreground image (i.e., (1 − φ) I)), and the last row is the background image (i.e., φ I). When we observe Figure 4b, it is remarkable that when we use our classification loss (12) and a regularization term |1 − φ| for segmentation (i.e., using Figure 1 except the Chan-Vese loss to classify the (1 − φ) I) as the foreground and classify the φ I as the background), we can get the area of the foreground. However, we cannot detect the whole area of the object. With Figure 4c,d, applying the Chan-Vese loss (7) and the Mumford-Shah loss (10) without classification loss (12) to the our simple auto-encoder can achieve a precise boundary of the object. However, the inside of the mask is not homogeneous. This is because (7) and (10) are minimized when pixels of similar intensity of the image are grouped together. Furthermore, the mask cannot detect the object as the foreground, since (7) and (10) are the functions of the intensity of the pixels. To make each region detected by the mask have the meaning of the foreground or the background, we concatenate the Chan-Vese loss (7) and the Mumford-Shah loss (10) with the classification loss (12) to make our total loss (6). The results of our total loss are in  Figure 4e,f shows that the area that was detected as black by the mask changed to the white area. This means that the mask finally detects the object as the foreground. Furthermore, when we compare with the baseline Figure 4a, we can get a more precise boundary since the Chan-Vese loss or the Mumford-Shah loss is used in addition to the classification loss as the segmentation loss. Figures 5 and 6 show more segmentation results when using our total loss (6). Figure 5 is the result with the [9] dataset and uses our total loss (6), where the L segmentation is composed of the Mumford-Shah functional. In Figure 5, the first column is the input images, the second column is the generated masks (i.e., 1 − φ) by our total loss (6), the third column is the foreground images (i.e., (1 − φ) I), and the last column is the background images (i.e., φ I). We can see that the masks detect the dogs and cats very accurately, and the boundaries resemble the shape of the dogs and cats. Furthermore, there are no parts of the dogs and cats in the background images. In Figure 6, we additionally add the ground-truth masks in the second column. When we compare the ground-truth mask (second column) and the generated mask by our total loss (6), we could get masks of very similar shape to the ground-truth masks even if the objects classified as foreground are small. However, when we use the PASCAL VOC dataset, we assume that the one image has only one image-level label, although there are multiple objects since our mask segments the image as the foreground and the background.

Quantitative Comparison
Since we assume that the one image has only one image-level class label, we calculate the mean IoU (intersection of union) for each image-level class label, not for the entire dataset. Table 1 shows the mean IoU calculated only for images belonging to each class. We show the results for four classes (dog, cat, horse, and cow) with a high IoU and for two classes (car, person) with a low IoU. When we compare the results of Baseline and our results (fourth row and fifth row of Table 1), our total loss (6) gets a higher mean IoU.    . The segmentation results with PascalVOC dataset. In each group, the first column is the input image, the second column is the ground truth binary mask, the third column is the generated binary mask by our Mumford-Shah network, the fourth column is the foreground region, and the fifth column is the background region calculated by the generated binary mask.

Discussions and Conclusions
Unlike other weakly supervised image segmentation, we assume that the one image only has one image-level label. Therefore, with our network and loss, we only can get the foreground and the background of the image. However, the segmentation energy functions of the Chan-Vese model or the Mumford-Shah model, which are revised to be applied in the deep learning framework, are efficient to get the precise boundaries of the objects. Furthermore, we do not optimize these energy functions with conventional level-set method. Instead, we optimize them with the stochastic gradient descent method, since these energy functions' elements are the output of the deep learning network. Because we use deep learning, the initial curve of the level-set method is not important, and the number of regions is fixed as two (the foreground and the background). Furthermore, with the classifier, the meaningless regions are changed to the foreground and the background, since the classifier needs the foreground regions to classify the image. Moreover, through our background loss, which acts similarly to assigning a background image to a new class label, the classification loss (12) with two regularizers (15) can also get the mask. Therefore, with our work, we can get the foreground and the background, although the image only has an image level class label and not other information. However, our networks have one significant limitation. Since we use the classifier's feature maps as the input of the decoder, the results change significantly with the classifier's performance. Therefore, we construct a new network that does not depend on the classifier for the following study.