SS-CPGAN: Self-Supervised Cut-and-Pasting Generative Adversarial Network for Object Segmentation

This paper proposes a novel self-supervised based Cut-and-Paste GAN to perform foreground object segmentation and generate realistic composite images without manual annotations. We accomplish this goal by a simple yet effective self-supervised approach coupled with the U-Net discriminator. The proposed method extends the ability of the standard discriminators to learn not only the global data representations via classification (real/fake) but also learn semantic and structural information through pseudo labels created using the self-supervised task. The proposed method empowers the generator to create meaningful masks by forcing it to learn informative per-pixel and global image feedback from the discriminator. Our experiments demonstrate that our proposed method significantly outperforms the state-of-the-art methods on the standard benchmark datasets.


Introduction
Generative adversarial networks (GANs) [1] have become a popular class of image synthesis methods due to their demonstrated ability to create high-dimensional samples with desired data distribution. The primary objective of GANs is to generate diverse, high-quality images while also ensuring the stability of GAN training [2,3]. GAN consists of generator and discriminator networks trained in an adversarial manner. The generator attempts to synthesize the real data distribution to fool the discriminator, whereas the discriminator's goal is to distinguish between the generator's real and fake data. In image segmentation, several compositional generative models have been proposed [4][5][6][7][8][9], where the generator creates a synthesized composite image by copying the object from one image and pasting it in another to fool the discriminator into thinking the synthesized composite image is real. However, the generator may not perform any segmentation, and the background may look realistic. Therefore, for effective training, the discriminator must provide the generator with informative learning signals by learning relevant semantics and structures of the data that may result in more effective generators. However, the current state-of-the-art GANs [9][10][11][12][13] employ discriminators based on the classification network, which learn only a single discriminative signal such as the difference between real and fake images. In such a non-stationary environment, the generator becomes prone to catastrophic forgetting and may lead to training instability or mode collapse [14].
To address the issues above, additional discriminatory signals are required to guide the training mechanism and assist the generator in producing high-quality images. This can be accomplished by increasing the capacity of the discriminator with auxiliary tasks and signals. These auxiliary tasks on the labeled datasets resist the forgetting issues and improve the training stability of GANs, but they suffer with unlabeled datasets. Recently, self-supervised learning has been explored on numerous GANs methods [14][15][16][17]. The self-supervised tasks provide the learning environment with additional guidance to the standard training mechanism. Most of the recent self-supervision methods on GANs use auxiliary tasks on transformation. For example, SS-GAN developed by Chen et al. [14] uses rotation prediction as an auxiliary task. In FX-GAN, Huang et al. [16] use the pretext task of prediction on corrupted real images, and in LT-GAN [15], the authors use distinguishing GAN-induced transformation as a pretext task. However, the goals of these self-supervised transformation tasks need to be consistent with the GAN's goal of mimicking the real data distribution. Moreover, this problem amplifies when the generator's task is to construct segmentation masks from the foreground images.
Recent self-supervised learning methods have demonstrated remarkable promise in global tasks such as image classification by training simple classifiers on features learned through instance discrimination. However, the pre-text tasks of these global featurelearning approaches do not explicitly retain spatial information, making them unsuitable for object segmentation [18][19][20][21]. To maintain an enriched real data representation and improve the quality of generated segmentation masks, we propose a Self-Supervised Cutand-Paste GAN(SS-CPGAN) using U-net architecture [22], which unifies cut-and-paste adversarial training with a self-supervised task. It allows the discriminator to learn local and global differences between real and fake data. In contrast to the existing transformation self-supervision methods, our self-supervision learning method creates pseudo labels using unsupervised segmentation methods. Then, it simultaneously forces the discriminator to provide the generator with global feedback (real or fake) and the per-pixel feedback of the synthesized images with the help of pseudo labels.
To sum up, the contributions of this paper are as follows: • This paper proposes a novel Self-Supervised Cut-and-Paste GAN (SS-CPGAN), which unifies cut-and-paste adversarial training with a segmentation self-supervised task. SS-CPGAN leverages unlabeled data to maximize segmentation performance and generates highly realistic composite images. • The proposed self-supervised task in SS-CPGAN improves the discriminator's representation ability by enhancing structure learning with global and local feedback. This enables the generator with additional discriminatory signals to achieve superior results and stabilize the training process. • This paper comprehensively analyzes the benchmark datasets and compares the proposed method with the baseline methods.

Unsupervised Object Segmentation via GANs
Unsupervised segmentation using GANs is an important topic in research. Several works [4][5][6][7] investigate the use of compositional generative models to obtain high-quality segmentation masks. Copy-pasting GAN [7] performs unsupervised object discovery by extracting foreground objects and then copying and pasting them onto different backgrounds. Similarly, PerturbGAN [5] generates a foreground mask along with a background and foreground image in an adversarial manner. Recently, Abdal et al. (2021) [6] proposed the use of an alpha network that includes two pre-trained generators and a discriminator on the StyleGAN to generate high-quality masks. These methods learn object segmentation without needing to use annotations. However, they are prone to degenerate solutions or other trivial cases. For example, the generator may not perform any segmentation, and the background looks realistic, or the generator may segment foreground masks consisting of all-ones. To avoid such problems, special care must be taken while training the compositional generative models. Copy-pasting GAN uses anti-shortcut, border-zeroing, blur, and grounded fakes to prevent trivial solutions [7]. PerturbGAN avoids such solutions by randomly shifting object segments relative to the background [5]. However, Abdal et al. (2021) [6] make several changes to the original StyleGAN and use a truncation trick along with regularization to avoid degenerate solutions. While these methods achieve object segmentation of foreground objects [23], the generated segmentation masks are often inferior in quality. Furthermore, due to such non-trivial procedures, the training of GANs becomes very challenging and complex [24].

Self-Supervised Learning
Self-supervised learning belongs to unsupervised learning, which learns useful feature representations from unlabeled data with the help of pretext tasks. It helps reduce the enormous data collection and annotation cost [25,26]. The traditional way to do this is to give the model some pretext tasks to solve. In this way, the networks learn good feature representations with the help of pseudo labels created by the pretext tasks [27]. Recently, many pretext tasks, and adversarial training, have been introduced [14][15][16]28,29]. The motivation for using self-supervised learning in GANs is to (1) prevent discriminator forgetting [30]; (2) improve training stability [31]; (3) and ensure the high quality of image generated [32]. The self-supervision techniques rely on pretext tasks on geometric transformations (e.g., prediction on rotated images [14], corrupted images [16], GAN-induced transformations [15], clustering representations [28], or a deshuffling task that predicts the shuffled orders [29]) to increase the discriminator's representation power. These self-supervised tasks may not work well for segmentation due to the inherent differences between the classification and segmentation tasks. In addition to this, image generation for segmentation requires GANs to capture contextual information between the foreground object and the background, which can be complicated in the absence of relevant visual representations. Unlike the aforementioned methods, we incorporate segmentation using self-supervised learning coupled with the Cut-and-Paste GAN to obtain high-quality segmentation masks. Most importantly, with our self-supervised approach, no extra care is needed to deal with the trivial solutions prevalent in compositional generative models.

Method
In this section, we first present the standard terminology of adversarial training and the encoder-decoder discriminator. We then introduce our SS-CPGAN method built upon the cut-and-paste adversarial training. The unified framework with the segmentation using self-supervised task encourages the generator to emphasize local and global structures while synthesizing masks.

Adversarial Training
We build a generative model in which the generator takes the foreground image as the input and generates a composite image using a combination of the predicted mask, the source foreground image, and the background image to fool the discriminator. Formally, we define the input foreground source image as I f ∈ P data and the background image as I b ∈ P data , where P data denotes the set of input images. Now, we define a generator (G) that is trained in an adversarial manner against the discriminator (D). During the training process, the generator predicts a segmentation mask defined by m g (I f ) ∈ [0, 1]. Then, using the predicted mask: m g (I f ); the foreground source image: I f ; and the resized background image: I b , we define the composite image as follows: The discriminator's objective is to classify the composite image as real or fake. As a result, the standard objective of the discriminator and the generator of the CPGAN is defined as follows: The discriminator works as a classification network restricted to learning only through the discriminative differences between the real and fake samples. Thus, the discriminator fails to provide any useful information to the generator. Therefore, we use an encoder-decoder discriminator network with self-supervised learning to mitigate this problem.

Encoder-Decoder Discriminator
In this work, we replace the standard classification discriminator with the U-net based discriminator. The U-net is an encoder-decoder architecture that consists of a network of convolutional layers, and skip connections for semantic segmentation [33][34][35]. It was initially proposed for biomedical image segmentation, which achieved precise segmentation results with few training images. Further, it demonstrates good results in other applications, including geo-sciences [36], remote sensing [37], and others. Its architecture (see Figure 1) is symmetric and consists of two paths: an encoder that extracts spatial features from the input image (downscaling process), and a decoder that constructs the segmentation map from the extracted feature maps (upscaling process). The use of U-net architecture in the proposed model adds major advantages: (1) it enables the simultaneous use of the global location and context while predicting masks; (2) it retains the full context of the input images, which is a significant advantage over patch segmentation approaches [38].
We use the encoder part of the U-net as the standard classification discriminator that performs the binary decision on real/fake composite images. Additionally, the decoder part of the U-net architecture is utilized by the self-supervised task to give per-pixel feedback on the synthesized images with the help of pseudo labels. This allows the discriminator to learn relevant local and global differences between real and fake images.

Self-Supervised Cut-and-Paste GAN (SS-CPGAN)
To improve the representation learning ability of the CPGAN, the discriminator must learn semantic and structural information from the synthesized images. Therefore, we use self-supervised learning to build comprehensive representations for the CPGAN. In this work, we employ a segmentation self-supervised task to enable the discriminator with enhanced learned features that ultimately empower the generator to create consistent and structurally coherent masks. The pseudo segmentation masks m US (I f ) ∈ [0, 1] are created using a graph unsupervised segmentation algorithm [39]. These masks obtained by the GrabCut technique act as a suitable prior for the U-net discriminator (see, Figure 2 (top)). Here, the discriminator performs two important tasks, i.e., (1) classification of real/fake compositing images; and (2) performing per-pixel classification on I f ∈ P data to generate segmentation masks. Given the self-supervised pseudo labels, we train the discriminator for accurate pixel-level prediction. Integrating self-supervisory signals empowers the discriminator by enhancing its localization ability and forces it to learn useful semantic representations. This mechanism enables the generator to achieve optimized results and makes the training process more stabilized. Formally, we define I f ∈ P data as the source image containing the foreground object, and P data denotes the set of input images. Further, we create a pseudo label denoted by m US (I f ) ∈ [0, 1], using an unsupervised segmentation algorithm. Then, we define m w (I C ) ∈ [0, 1] as the pixel-wise segmentation mask produced by the decoder of the discriminator. Hereafter, we optimize the overall discriminator loss function (Equation (5)) by augmenting a new self-supervision loss (Equation (4)) where L is the cross-entropy loss, and λ denotes the loss weight for the self-supervision loss. This hyperparameter is updated according to the comparison between m US (I f ) and m w (I C ), using intersection-over-union (IoU). The details of the hyperparameter chosen are explained in the implementation details section. The framework of self-supervised learning is shown in Figure 2 (bottom).

Experimentation
This section discusses the implementation details of the proposed method and an extensive set of experiments on various datasets.

Datasets
We utilize five different datasets for the foreground and background set to train our SS-CPGAN as described below: MIT Places2 is a scene-centric dataset with more than 10 million images consisting of over 400 unique scene classes. However, in the experiments, we use the classes rainforest, forest, sky, and swamp as a background set for the Caltech-UCSD Birds dataset, and in the Oxford 102 Flowers, we use the class: herb garden as a background set.
• Singapore Whole-sky IMaging CATegories (SWIMCAT)contains 784 images of five categories: patterned clouds, clear sky, thick dark clouds, veil clouds, and thick white clouds. We use the SWIMCAT dataset as a background set for the FGCV dataset.
We chose background datasets similar to the background of the images from the foreground dataset. For the foreground datasets, we use Caltech-UCSD Birds (CUB) 200-2011 [40], Oxford 102 Flowers [41], and FGCV Aircraft (Airplanes) [42]. During the training, we do not utilize the masks available with datasets, Caltech-UCSD Birds, and Flowers-102. For the background datasets, we use MIT Places2 [43], and SWIMCAT [44].

Experimental Setings
Our implementation used the PyTorch framework. For training our models, we deploy a batch size of 16 and the Adam optimizer with an initial learning rate of 2 × 10 −4 . The training images are reshaped to 64 × 64, 128 × 128, and 256 × 256. For the self-supervision task, we use the GrabCut technique [39] as the unsupervised segmentation algorithm.

Hyper-Parameter Range
SS-CPGAN presents a new self-supervised loss, i.e., L sel f −supervised , which needs to be validated. As shown in Table 1, we present Structural similarity (SSIM) scores according to different values of λ. The SSIM scores vary between the range of [0,1], with lower values indicating the lower quality of generated images. During the experiments, we find that the optimal values of the hyper-parameter can vary depending on the intersection-over-union (IoU) score between the pseudo label (mask) and the predicted mask by the discriminator. Initially, when IoU < 0.2, the hyperparameter value is set to 0.5 to boost the model's ability to learn useful representations from the pseudo label. When the 0.2 < IoU < 0.8, we refine the predicted mask using the hyperparameter value λ of 0.1. To avoid the pseudo labels compromising the predicted masks, we restrict the value λ to 0 when the IoU > 0.8.

Results
We utilized the Fréchet inception distance (FID) score and mean Intersection over Union (mIoU) metric for the quantitative evaluation of our methods. In this work, we use the FID score on the datasets CUB2011, Oxford 102 Flowers, and FGCV Aircraft (see Table 2) to compare the SS-CPGAN model with the CPGAN model images spatially scaled to 64 × 64, 128 × 128, and 256 × 256. For the datasets with available ground truth masks, including CUB2011, and Oxford 102 Flowers, we use the mIoU metric as shown in Table 3.  In Figure 3, we report the FID scores over the training iterations. We show that our method stabilizes GAN training across all of the datasets by allowing GAN training to converge faster and consistently improve performance throughout the training. According to Figure 3, our method, SS-CPGAN, utilizing self-supervision outperforms the baseline method, CPGAN, on each dataset used. Furthermore, as shown in Figure 4, the generated masks and composite images of our proposed SS-CPGAN are of superior quality. The standard classification discriminator of CPGAN does not provide effective guidance to the generator. During the training, the standard discriminator is not encouraged to learn more robust data representation. The classification task learns only the representation based on the discriminative differences between real/fake images and fails to give information on why the synthesized image looks fake. Notably, our self-supervision task assigned to the U-net discriminator provides the generator with global feedback (real or fake) and per-pixel feedback of the masks with the help of pseudo labels. The self-supervisory signals prevent the two scenarios for the generator, which the standard discriminator fails to do, i.e., creating constant masks of only all-zeros pixel values or all-ones pixel values. The enhanced discriminator of SS-CPGAN influences the generator to create high quality masks that are devoid of any such anomalies. As shown in Figure 4, the qualitative analysis of the proposed SS-CPGAN shows that the generated masks and composite images are of superior quality.

Comparison with the State-of-the-Art
We compare our self-supervision-based Cut-and-Paste GAN (SS-CPGAN) with stateof-the-art. As shown in Table 4, we report and compare the FID score on the Caltech UCSD-Bird 200 dataset. Specifically, the FID scores of StackGANv2 [45], OneGAN [46], LR-GAN [47], ELGAN [48], and FineGAN [49] are listed. The results in Table 4 show that our method delivers better performance and outperforms the existing methods. LR-GAN [47] performed the worst, followed by the other methods. The low performance of layer-wise GANs [47,48] is attributed to the fact that these methods are prone to degenerate during the training phase, with all the pixels being assigned as one component. In Table 5, we compare the performance of our method to the recent methods using the mIoU metric on Caltech UCSD-Bird 200 and Oxford flowers-102 respectively. In comparison to PerturbGAN [5], ContraCAM [50], ReDO [4], UISB [51], and IIC-seg [52], our method outperforms by a large margin on the Caltech UCSD-Bird 200 dataset. On the Oxford flowers-102 dataset, we perform better than the methods ReDO [4], Kyriazi et. al [53] and Voynov et. al. [54]. Here, ReDO and Kyriazi et. al (2021) are unsupervised approaches, whereas Voynov et. al (2021) is a weakly supervised approach to create segmentation maps. The ability to leverage pseudo labels in the training of Cut-and-Paste GAN assists in creating foreground masks of superior quality.

Conclusions
In this work, we proposed a novel Self-Supervised Cut-and-Paste GAN method to learn object segmentation. Specifically, we unified the cut-and-paste adversarial training with the proposed segmentation based self-supervision learning. Unlike the existing transformation self-supervised methods, our method improves the discriminator's representation ability by enhancing structure learning with global and local feedback from the synthesized masks. Furthermore, SS-CPGAN overcomes the issue of unwanted trivial solutions (generating constant masks of only all-zeros or all-ones pixel values) that plagues the generator. The experimental results show that our approach generates superior quality images and achieves promising results on the benchmark datasets.