Generative Adversarial Network for Overcoming Occlusion in Images: A Survey

: Although current computer vision systems are closer to the human intelligence when it comes to comprehending the visible world than previously, their performance is hindered when objects are partially occluded. Since we live in a dynamic and complex environment, we encounter more occluded objects than fully visible ones. Therefore, instilling the capability of amodal perception into those vision systems is crucial. However, overcoming occlusion is difﬁcult and comes with its own challenges. The generative adversarial network (GAN), on the other hand, is renowned for its generative power in producing data from a random noise distribution that approaches the samples that come from real data distributions. In this survey, we outline the existing works wherein GAN is utilized in addressing the challenges of overcoming occlusion, namely amodal segmentation, amodal content completion, order recovery, and acquiring training data. We provide a summary of the type of GAN, loss function, the dataset, and the results of each work. We present an overview of the implemented GAN architectures in various applications of amodal completion. We also discuss the common objective functions that are applied in training GAN for occlusion-handling tasks. Lastly, we discuss several open issues and potential future directions.


Introduction
Artificial intelligence has revolutionized the world. With the advent of deep learning and machine learning-based models, many applications and processes in our daily life have been automated. Computer vision is prominently essential in these applications, and while humans can effortlessly make sense of their surrounding, machines are far from achieving that level of comprehension. Our environment is dynamic, complex, and cluttered. Objects are usually partially occluded by other objects. However, our brain completes the partially visible objects without us being aware of it. The capability of humans to perceive incomplete objects is called amodal completion [1]. Unfortunately, this task is not as straightforward and easy for computers to achieve, because occlusion can happen in various ratios, angles, and viewpoints [2]. An object may be occluded by one or more objects, and an object may hide several other objects.
GAN is a structured probabilistic model that consists of two networks, a generator that captures the data distributions and a discriminator that decides whether the produced data come from the actual data distribution or from the generator. The two networks train in a two-player minimax game fashion until the generator can generate samples that are similar to the true samples, and the discriminator can no longer distinguish between the real and the fake samples.
Since its first introduction by Goodfellow et al. in 2014, numerous variants of GAN are proposed, mainly architecture variants and loss variants [3]. The modifications in the first category can either be in the overall network architecture such as progressive GAN (PROGAN) [4], in representation of the latent space such as conditional GAN (CGAN) [5], or in modifying the architecture toward a particular application as in CycleGAN [6]. The second category of variants encompasses modifications that are introduced to the loss functions and regularization techniques such as the Wasserstein GAN (WGAN) [7] and PatchGAN [8].
Despite the various modifications, GAN is challenging to train and evaluate. However, due to its generative power and outstanding performance, it has a significantly large number of applications in computer vision, bio-metric systems, medical field, etc. Therefore, there are a considerable number of reviews carried out on GAN and its application in different domains (shown in Section 3). There are a limited number of existing reviews that briefly mention overcoming occlusion in images with GAN. Therefore, in this survey we concentrate on the applications of GAN in amodal completion in detail. In summary, the contributions of this survey paper are:

1.
We survey the literature for the available frameworks where they utilize GAN in one or more aspects of amodal completion.

2.
We discuss in detail the architecture of existing works and how they have incorporated GAN in tackling the problems that occur from occlusion. 3.
We summarize the loss function, the dataset, and the reported results of the available works. 4.
We also provide an overview of prevalent objective functions in training the GAN model for amodal completion tasks. 5.
Finally, we discuss several directions for the future research in tasks of occlusion handling wherein GAN can be utilized.
The term "occlusion handling" is polysemous in the computer vision literature. In object tracking, it mostly refers to the ability of the model to address occlusions and resume tracking the object once it re-appears in the scene [9]. In classification and detection tasks, the term indicates determining the depth order of the objects and the occlusion relationship between them [10]. Other works such as [11,12] define occlusion handling as the techniques that interpolate the blank patches in an object, i.e., content completion. However, we believe that, in order to enable a model to address occlusions, it needs the same tasks defined in amodal completion. Therefore, in this survey we use "amodal completion" and "occlusion handling" interchangeably.
As a limitation, we only focus on occlusion handling in a single 2D image. Therefore, occlusion in 3D images, stereo images, and video data are out of the scope of this work. Additionally, we emphasize on the GAN component of each architecture we reviewed. As GAN is applied for various tasks in different problems, it is difficult to carry out a systematic comparison of the existing models. Each model is evaluated on a different dataset using a different evaluation metric for a different task. In some cases, the papers do not assess the performance of GAN. In those cases, we present the result of the entire model.
The rest of this document is organized as follows: the methodology for conducting this survey is presented in Section 2. Next, Section 3 mentions the related available articles in the literature. Section 4 introduces the fundamental concepts about GAN and its training challenges, and the aspects of amodal completion. Afterward, Section 5 presents the problems in amodal completion and how GAN has been applied to address them. The common loss functions in GAN for amodal completion are discussed in Section 6. In Sections 7 and 8, future directions and key findings of this survey article are presented. Finally, conclusions are enunciated in Section 9.

Methodology
To perform a descriptive systematic literature review, we begin by forming the research questions which this survey attempts to answer. The questions are (1) what are the challenges in amodal completion? (2) how are GAN models applied to address the problems of amodal completion? Based on the formulated questions, the search terms are identified to find and collect relevant publications. The search keywords are "GAN AND occlusion", "GAN AND amodal completion", "GAN AND occlusion handling", "GAN for occlusion handling", and "GAN for amodal completion".
We inspect several research databases, such as IEEE Xplore, Google Scholar, Web of Science, and Scopus. The list of the returned articles from the search process is sorted and refined by excluding the publications that do not satisfy the research questions. The elimination criteria are as follows: the research article addresses aspects of occlusion handling but do not employ GAN; GAN is used in applications other than amodal completion; the authors have worked on occlusion in 3D data, or video frames. Subsequently, each of the remaining publications in the list is investigated and summarized. The articles are examined for the GAN architecture, the objective function, the dataset, the results, and the purpose of using GAN.

Related Works
Occlusion: Handling occlusion has been studied in various domains and applications. Table 1 shows the list of published surveys and reviews of occlusion in several applications. A survey of occlusion handling in generic object detection of still images is provided in [2], focusing on challenges that arise when objects are occluded. Similarly, the most recent survey article by the authors of [13] provides the taxonomy of problems in amodal completion from single 2D images. However, none of those review articles concentrate on the applications of GAN for overcoming occlusion particularly. Other works have focused on occlusion in specific scopes, such as object tracking [14,15], pedestrians [16,17], human faces [18][19][20][21][22], automotive environment [23,24], and augmented reality [25]. In contrary, we review the articles that address occlusion in single 2D images. Generative Adversarial Network: Due to their power, GANs are ubiquitous in computer vision research. Due to the growing body of published works in GAN, there are several recent surveys and review papers in the literature investigating its challenges, variants, and applications. Table 2 contains a list of survey articles that have been published in the last five years. The list does not include papers that specifically focus on GAN applications outside the computer vision field.
The authors in [27][28][29][30][31][32] discuss the instability problem of GAN with the various techniques and improvisations that have been designed to stabilize its training. Adversarial attack can be carried out against machine learning models by generating an input sample that leads to unexpected and undesired results by the model. Sajeeda et al. [27] investigate the various defense mechanisms to protect GAN against such attacks. Li et al. [33] summarize the different models into two groups of GAN architectures: the two-network models and the hybrid models, which are GANs combined with an encoder, autoencoder, or variational autoencoder (VAE) to enhance the training stability. The authors of [34,35] explore the available evaluation metrics of GAN models. Other works have discussed the application of different GAN architectures for computer vision [36,37], image-to-image translation [38,39], face generation [40,41], medical field [29,[42][43][44], person re-identification (ReID) [45], audio and video domains [29], generating and augmenting training data [46,47], image super-resolution [39,48], and other real-world applications [39,45,49,50]. Some of the mentioned review articles discuss the occlusion handling as an application of GAN very briefly, without detailing the architecture, loss functions, and the results. In this paper, we focus on the works that combine the two above-mentioned topics. Specifically, we want to present the works that have been carried out to tackle the problems that arise from occlusion using GAN. However, depending on the nature of the problems, the applicability of GAN varies. For example, in amodal appearance generation, GAN is the optimal choice of architecture. Comparably, in amodal segmentation and order recovery tasks, it is less used.

Generative Adversarial Network
GAN is an unsupervised generative model that contains two networks, namely a generator and a discriminator. The two networks learn in an adversary manner similar to the min-max game between two players. The generator tries to generate a fake sample that the discriminator cannot distinguish from the real sample. On the other hand, the discriminator learns to determine whether the sample is real data or generated. The generator G takes a random noise z as input. It learns a probability distribution p g over data x to generate fake samples that imitate the real data distribution (p data ). Then, the generated sample is forwarded to the discriminator D which outputs a single scalar that labels the data as real or fake ( Figure 1). The classification result is used in training G as gradients of the loss. The loss guides G to generate samples that are less likely and more challenging to be labeled as fake by the D. Overtime, G becomes better in generating more realistic samples that would confuse D, and D becomes better at detecting fake samples. They both try to optimize their objective functions, in other words, G tries to minimize its cost value and D tries to maximize its cost value.
Equation (1) was designed by Goodfellow et al. [53] to compute the cost value of GAN where x is the real sample from the training dataset, G(z) is the generated sample, and D(x) and D(G(z)) are the discriminator's verdict that x is real and the fake sample G(z) is real. There are numerous variations of the original GAN. Among the most prominent ones are CGAN, WGAN, and Self-Attention GAN (SAGAN) [54]. CGAN extends the original GAN by taking an additional input which is usually a class label. The label conditions the generated data to be of a specific class. Therefore, the loss function in (1) becomes as follows: where c is the conditional class label. In order to prevent the vanishing gradient and mode collapse problems (discussed below), WGAN applies an objective function that implements the Earth-Mover (EM) [55] distance for comparing the generated and real data distributions. EM helps in stabilizing GAN's training and the equilibrium between the generator and the discriminator. If the gradient of the loss function becomes too large, WGAN will employ weight clipping. WGAN Gradient Penalty (WGAN-GP) [56] extends WGAN by introducing a penalty term instead of the weight clipping to enhance the training stability, convergence power, and output quality of the network. Moreover, SAGAN applies an attention mechanism to extract features from a broader feature space and capture global dependencies instead of the local neighborhoods. Thus, SAGAN can produce high-resolution details in data as it borrows cues from all feature locations in contrast to the original GAN that depends on only spatially local points.
In theory, both G and D are expected to converge at the Nash equilibrium point. However, in practice this is not as simple as it sounds. Training GANs is challenging, because they are unstable and difficult to evaluate. GANs are notorious for several issues, which are already covered intensively in the literature; therefore, we will only discuss them briefly below.

Achieving Nash Equilibrium
In game theory, Nash equilibrium is when none of the players will change their strategy no matter what the opponents do. In GAN, the game objective changes as the networks take turn during the training process. Therefore, it is particularly difficult to obtain the desired equilibrium point due to the adversarial behavior of its networks. Typically, gradient descent is used to find the minimum value of the cost function during training. However, in GAN, decreasing the cost of one network leads to the increase in the cost of the other network. For instance, if one player minimizes xy with regard to x and another player minimizes −xy with regard to y, gradient descent reaches a stable sphere, but it does not converge to the equilibrium point which is x = y = 0 [57].

Mode Collapse
One of the major problems with GANs is that they are unable to generalize well. This poor generalization leads to mode collapse. The generator collapses when it cannot generate large diverse samples known as complete collapse, or it will only produce a specific type (or subset) of target data that will not be rejected by the discriminator as being fake, known as partial collapse [53,57].

Vanishing Gradient
GAN is challenging to train due to the vanishing gradient issue. The generator stops learning when the gradients of the weights of the initial layers become extremely small. Thus, the discriminator confidently rejects the samples produced by the generator [58].

Lack of Evaluation Metrics
Despite the growing progress in the GAN architecture and training, evaluating it remains a challenging task. Although several metrics and methods have been proposed, there is no standard measure for evaluating the models. Most of the available works propose a new technique to assess the strength and the limitation of their model. Therefore, finding a consensus evaluation metric remains an open research question [59].

Amodal Completion
Amodal completion is the natural ability of humans to discern the physical objects in the environment even if they are occluded. Our environment contains more partially visible or temporarily occluded objects than fully visible ones. Hence, the input to our visual system is mostly incomplete and segmented. Yet, we innately and effortlessly imagine the invisible parts of the object in our mind and perceive the object as complete [1]. For instance, if we only see a half of stripped legs in the zoo, we can tell that there is a zebra in that territory.
As natural and seamless this task is for humans, for computers it is challenging yet essential. This is because the performance of most computer vision-related real-world applications drop when objects are occluded. For example, in autonomous driving, the vehicle must be able to recognize and identify the complete contour of the objects in the scene to avoid accidents and drive safely.
Our environment is complex, cluttered, and dynamic. An object may be behind one or more other objects, or an object may hide one or more other objects. Thus, possible occlusion patterns between objects are endless. Therefore, the shape and appearance of occluded objects are unbounded. Whenever a visual system requires de-occlusion, there are three sub-tasks involved in the process ( Figure 2). Firstly, inferring the complete segmentation mask of the partially visible objects, including the hidden region. Secondly, predicting and reconstructing the RGB content of the occluded area based on the visible parts of the object and/or the image. Often, these two sub-tasks require the result of the third sub-task, which determines the depth order of the objects and the relationship between them, i.e., which object is the occluder and which one is the occludee. Several of the existing works address these sub-tasks simultaneously.  Designing and training a model that could perform any/all of the above-mentioned sub-processes presents several challenges. In the following section, we explore the existing works in the literature wherein a GAN architecture is implemented to address those obstacles.

GAN in Amodal Completion
The taxonomy of the challenges in amodal completion is presented by Ao et al. [13]. In the following sections, we present how GAN has been used to address each challenge. In exploring the existing research papers, we emphasized the aspects of amodal completion wherein GAN was utilized, not the original aim of the paper.

Amodal Segmentation
Image segmentation tasks such as semantic segmentation, instance segmentation, or panoptic segmentation solely predict the visible shape of the objects in a scene. Therefore, these tasks mainly operate with modal perception. Amodal segmentation, on the other hand, works with amodal perception. It estimates the shape of an object beyond the visible region, i.e., the visible mask (also called the modal mask) and the mask for the occluded region, from the local and the global visible visual cues (see Figure 3).
Amodal segmentation is rather challenging, especially if the occluder is of a different category (e.g., the occlusion between vehicles and pedestrians). The visible region may not hold sufficient information to help in determining the whole extent of the object. Contrariwise, if the occluder is an instance of the same category (e.g., occlusion between pedestrians), since the features of both objects are similar, it becomes difficult for the model to estimate where the boundary of one object ends and the second one begins. In either case, the visible region plays a significant role in guiding the amodal mask generation process. Therefore, most existing methods require the modal mask as input. To alleviate the need for a manually annotated modal mask, many works apply a pre-trained instance segmentation network to obtain the visible mask and utilize it as input.  In the following, we describe the architecture of the GAN-based models that are used in generating the amodal mask of the occluded objects.
A two hourglass generator: Zhou et al. [60] apply a pre-trained instance segmentation network on the input image to obtain an initial mask and feeds it to a two-stage pipeline for human deocclusion. Given the initial mask, the generator implements two hourglass modules to refine and complete the modal mask to produce the amodal mask at the end. A discriminator enhances the quality of the output amodal mask. An additional parsing result accompanies the result of the generator, which is employed by a Parsing Guided Attention (PGA) module to reinforce the semantic features of body parts at multiple scales as a part of a parsing guided content recovery network. The latter uses a combination of UNet [61] and partial convolutions [62] in generating the content of the invisible area. The additional parsing branches add extra semantic guidance, which improves the final invisible mask.
A coarse-to-fine architecture with contextual attention: Xiong et al. [63] firstly employ a contour detection module to extract the visible contour of an object and then complete it through a contour completion network. The contour detection module uses DeepCut [64] to segment prominence objects, and performs noise removal and edge detection to extract the incomplete contour of the object from the segmentation map. Then, the contour completion network learns to conjecture the foreground contour. The contour completion network is composed of a generator and a discriminator. The generator has a coarse-to-fine architecture, each with a similar encoder-decoder structure, except that the refinement network employs a contextual attention layer [65]. Finally, the completed contour along with the ground-truth image are fed to the discriminator which produces a score map to indicate the originality of each region in the generated contour mask and can decide whether the mask aligns with the contour of the image. The discriminator is a fully convolutional PatchGAN [8] trained with a hinge loss. The results show that the contour completion step assists in the explicit modeling of the background and the foreground layer borders, which leads to less evident artifacts in the completed foreground objects.
A generator with priori knowledge: The authors of [66] also utilize a pre-trained instance segmentation model to obtain the visible human mask, which is fed with the input image into a GAN-based model to produce the amodal mask of occluded humans. The model predicts the mask of the invisible region through an hourglass network structure.
The local fine features and the higher-level semantic details are aggregated in the encoding stage, and they are added to each layer's feature maps in the decoding stage. The predicted amodal mask is evaluated by a Patch-GAN discriminator. To improve the amodal segmentation outcome, some typical human poses are concatenated with the feature maps as a priori information to be used in the decoding stage. Although the a priori knowledge enhances the predicted amodal masks, it restricts the application of the model to humans with specific poses.
A coarse-to-fine architecture with multiple discriminators: In the applications such as visual surveillance and autonomous driving, path prediction, and intelligent traffic control, detecting vehicles and pedestrians is essential. However, these are often obstructed by other objects which makes the task of learning the visual representation of intended objects more challenging. The model in [67] aims to recover the amodal mask of a vehicle and the appearance of its hidden regions iteratively. To tackle both tasks, the model is composed of two parts: a segmentation completion module and an appearance recovery module. The first network, follows an initial-to-refined framework. Firstly, an initial segmentation mask is generated by taking an input image with occluded vehicles through a pre-trained segmentation network. Then, the input image is fed again into the next stage after it is concatenated with the output from the initial stage. The second part, in contrary to a standard GAN, has a generator with an encoder-decoder structure, an object discriminator, and an instance discriminator. To assist the model in producing more realistic masks, an additional 3D model pool is employed. This provides silhouette masks as adversarial samples which motivates the model to learn the defining characteristics of actual vehicle masks. The object discriminator, which uses a Stack-GAN structure [68], enforces the output mask to be similar to a real vehicle, whereas the instance discriminator with a standard GAN structure aims at producing an output mask similar to the groundtruth mask. The recovered mask is fed to the appearance recovery module to regenerate the whole foreground vehicle. Both modules are trained with reconstruction loss (i.e., L1 loss) and perceptual loss. Although using the 3D model pool and multiple discriminators produces better amodal masks, when the model is tested on synthetic images with different types of synthetic occlusions, it requires multiple iterations to progressively eliminate the occlusions. However, on real images with less severe occlusions, the model is unable to refine the results beyond three iterations and its performance declines.

Order Recovery
In order to apply any de-occlusion or completion process, it is essential to determine the occlusion relationship and identify the depth order between the overlapping components of a scene. Other processes such as amodal segmentation and content completion depend on the predicted occlusion order to accomplish their tasks. Therefore, vision systems need to distinguish the occluders from the occludees, and to determine whether an occlusion exists between the objects. Order recovery is vital in many applications, such as semantic scene understanding, autonomous driving, and surveillance systems.
The following works attempt to retrieve the depth order/layer order between the objects in a scene through utilizing a GAN-based architecture.
A generator with multiple discriminators: Dhamo et al. [69] present a method to achieve layered depth prediction and view synthesis. Given a single RGB image as input, the model learns to synthesize a RGB-D view from it and hallucinates the missing regions that were initially occluded. Firstly, the framework uses a fully-convolutional network to obtain a depth map and a segmentation mask for foreground and background elements from the input image. Depending on the predicted masks, the foreground objects are erased from the input image and the obtained depth map (RGB-D). Then, a Patch-GAN [8]-based network is used to refill the holes in the RGB-D background image that were created from removing the foreground objects. The network has a pair of discriminators to enforce inter-domain consistency. This method has data limitations, as it is difficult to obtain ground-truth layered depth images in real-world data.
Inferring the scene layout beyond the visible view and hallucinating the invisible parts of the scene is called amodal scene layout. MonoLayout, proposed in [70], provides the amodal scene layout in the form of bird's eye view (BEV) in real time. With a single input image of a road scene, the framework delivers a BEV of static (such as sidewalks and street areas) and dynamic (vehicles) objects in the scene, including the partially visible components. The model contains a context encoder, two decoders, and two discriminators. Given the input image, the encoder captures the multi-scale context representations of both static and dynamic elements. Then, the context features are shared with two decoders, an amodal static scene decoder and a dynamic scene decoder, to predict the static and dynamic objects in BEV. The decoders are regularized by two corresponding discriminators to encourage the predictions to be similar to the ground-truth representations. The context sharing within the decoders achieves better performance of amodal scene layout. MonoLayout can infer 19.6 M parameters in 32 fps. However, it needs generalization for unseen scenarios.
A single generator and discriminator: Zheng et al. [71] tackle the amodal scene understanding by creating a layer-by-layer pipeline (Completed Scene Decomposition Network (CSDNet)) to extract and complete RGB appearance of objects from a scene, and make sense of their occlusion relation. In each layer, CSDNet only separates the foreground elements that are without occlusion. This way, the system identifies and fills the invisible portion of each object. Then, the completed image is fed again to the model to segment the fully visible objects. In this iterative manner, the depth order of the scene is obtained, which can be used to recompose a new scene. The model is composed of a decomposition network and a completion network. The decomposition network follows Mask-RCNN [72] with an additional layer classification branch to estimate the instance masks, and determine whether an object is fully or partially visible. The predicted masks are forwarded to the completion network, which uses an encoder-decoder to complete the resultant holes in the masked image. By masking the fully visible objects in each step and the iterative completion of the objects in the scene, the earlier completion information is propagated to the later steps. Nonetheless, the model is trained on a rendered dataset; therefore, it cannot generalize well to real scenes that are unlike the rendered ones. In addition, the completion errors over the layers are accumulated, which leads to a drop in accuracy when the occlusion layers are too numerous.
On the other hand, Dhamo et al. [73] present an object-oriented model with three parts: object completion, layout prediction, and image re-composition, while the object completion unit attempts to fill the occluded area in the input RGBA image through an auto-encoder, the layout prediction uses a GAN architecture to estimate the RGBA-D (the RGBA and depth images) background, i.e., the object-free representation of the scene. The model infers the layered representation of a scene from a single image and produces a flexible number of output layers based on the complexity of the scene. However, the global and the local contexts, and the spatial relationship between the objects in the scene, are not considered.

Amodal Appearance Reconstruction
Recently, there has been a significant progress in image inpainting methods, such as the works in [65,74]. However, these models recover the plausible content of a missing area with no knowledge about which object is involved in that part. On the contrary, amodal appearance reconstruction (also known as amodal content completion) models require identifying individual elements in the scene, and recognizing the partially visible objects along with their occluded areas, to predict the content for the invisible regions.
Therefore, the majority of the existing frameworks follow a multi-stage process to address the problem of amodal segmentation and amodal content completion as one problem. Therefore, they depend on the segmentator to infer the binary segmentation mask for the occluded and non-occluded parts of the object. The mask is then forwarded as input to the amodal completion module, which tries to fill in the RGB content for the missing region indicated by the mask.
Among the three sub-tasks of amodal completion, GAN is most widely used in amodal content completion. In this section, we present the usage of GAN in amodal content completion for a variety of computer vision applications.

Generic Object Completion
GANs are unable to estimate and learn the structure in the image implicitly with no additional information about the structures or annotations regarding the foreground and background objects during training. Therefore, Xiong et al. [63] propose a model that is made up of a contour detection module, a contour completion module, and an image completion module. The first two modules learn to detect and complete the foreground contour. Then, the image completion module is guided by the completed contour to determine the position of the foreground and the background pixels. The incomplete input image, the completed contour, and the hole mask are fed to the image completion network to fill the missing part of the object. The network has a similar coarse-to-fine architecture as the contour completion module. However, the depth of the network weakens the effect of the completed contour. Therefore, the complete contour is passed to both the coarse network and the refinement network. The discriminator of the image completion network is a PatchGAN that is trained with hinge loss and requires the generated fake image or the ground-truth image with the hole mask. The experiments show that, under the guide of the contour completion, the model can generate completed images with less artifacts and complete objects with more natural boundaries. However, the model will fail to produce results without artifacts and color discrepancy around the holes due to implementing vanilla convolutions in extracting the features.
Therefore, Zhan et al. [75] use CGAN and partial convolution [62] to regenerate the content of the missing region. The authors apply the concept of partial completion to de-occlude the objects in an image. In the case of an object hidden by multiple other objects, the partial completion is performed by considering one object at a time. The model partially completes both the mask and the appearance of the object in question through two networks, namely Partial Completion Network-mask (PCNet-M) and Partial Completion Network-content (PCNet-C), respectively. A self-supervised approach is implemented to produce labeled occluded data to train the networks, i.e., a masked region is obtained by positioning a randomly selected occluder from the dataset on top of the concerned object. Then, the masked occludee is passed to the PCNet-M to reproduce the mask of the invisible area, which in turn is given to the PCNet-C. Although the self-supervised and partial completion techniques alleviate the need for annotated training data, the generated content contains the remaining of the occluder and its quality is not good if it has texture.
Ehsani et al. [76] trained a GAN-based model dubbed SeGAN. The model consists of a segmentator which is a modified ResNet-18 [77], and a painter which is a CGAN. The segmentator produces the full segmentation mask (amodal mask) of the objects including the occluded parts. On the other hand, the painter, which consists of a generator and a discriminator, takes in the output from the segmentator and reproduces the appearance of the hidden parts of the object based on the amodal mask. The final output from the generator is a de-occluded RGB image which is then fed into the discriminator. As a drawback, the model is trained on a synthetic dataset, which presents an inevitable domain gap between the training images and the real-world testing images.
Furthermore, Kahatapitiya et al. [78] aim to detect and remove the unrelated occluders, and inpaint the missing pixels to produce an occlusion-free image. The unrelated objects are identified based on the context of the image and a language model. Through a background segmentator and the foreground segmentator, the background and foreground objects are extracted, respectively. The foreground extractor produces pixel-wise annotations for the objects (i.e., thing class) and the background segmentator outputs the background objects (i.e., stuff class). Then, the relation predictor uses the annotations to estimate the relation of each foreground object to the image context based on a vector embedding of class labels trained with a language model. The result of the relation prediction can detect any unrelated objects which are considered as unwanted occlusion. Consequently, the relations and pixel annotations of the thing class are fed into the image inpainter to mask and recreate the pixels of the hidden object. The image inpainter is based on the contextual attention model by Yu et al. [65], which employs a coarse-to-fine model. In the first stage, the mask is coarsely filled in. Then, the second stage utilizes a local and a global WGAN-GP [56] to enhance the quality of the generated output from the coarse stage. A contextual attention layer is implemented to attend to similar feature patches from distant pixels. The local and global WGAN-GP enforce global and local consistency of the inpainted pixels [65]. The contextual information helps in generating a de-occluded image; however, the required class labels of the foreground and background objects limit the applicability of the method.

Face Completion
Occlusion is usually present in faces. The occluding objects can be glasses, scarf, food, cup, microphone, etc. The performance of biometric and surveillance systems can degrade when faces are obstructed or covered by other objects, which raises a security concern. However, compared to background completion, facial images are more challenging to complete since they contain more appearance variations, especially around the eyes and the mouth. In the following, we categorize the available works for face completion based on their architecture.
A single generator and discriminator: Cai et al. [79] present an Occlusion-Aware GAN (OA-GAN), with a single generator and discriminator, that alleviates the need for an occlusion mask as an input. Through using paired images with known mask of artificial occlusions and natural images without occlusion masks, the model learns in a semi-supervised way. The generator has an occlusion-aware network and a face completion network. The first network estimates the mask for the area where the occlusion is present, which is fed into the second network. The latter then completes the missing region based on the mask. The discriminator employs an adversarial loss, and an attribute preserving loss to ensure that the generated facial image has similar attributes to the input image.
Likewise, Chen et al. [80] depend on their proposed OA-GAN to automatically identify the occluded region and inpaint it. They train a DCGAN on occlusion-free facial images, and use it to detect the corrupted regions. During the inpainting process, a binary matrix is maintained, which indicates the presence of occlusion in each pixel. The detection of occluded region alleviates the need for any prior knowledge of the location and type of the occlusion masks. However, incorrect occlusion detection leads to partially inpainted images.
Facial Structure Guided GAN (FSG-GAN) [81] is a two-stage model with a single generator and discriminator. In the first part, a variational auto-encoder estimates the facial structure which is combined with the occluded image and fed into the generator of the second stage. The generator (UNet), guided by the facial structure knowledge, synthesizes the deoccluded image. A multi-receptive fields discriminator encourages a more natural and less ambiguous appearance of the output image. Nevertheless, the model cannot remove occlusion in a face image with large posture well, and it cannot correctly predict the facial structure under severe occlusions, which leads to unpleasant results.
Multiple discriminators: Several of the existing works employ multiple discriminators to ensure that the completed facial image is semantically valid and consistent with the context of the image. Li et al. [82] train a model with a generator, a local discriminator, a global discriminator, and a parsing network to generate an occlusion-free facial image. The original image is masked with a randomly positioned noisy square and fed into the generator which is designed as an auto-encoder to fill the missing pixels. The discriminators, which are binary classifiers, enhance the semantic quality of the reconstructed pixels. Meanwhile, the parsing network enforces the harmony of the generated part and the present content. The model can handle various masks of different positions, sizes, and shapes. However, the limitations of the model include the facts that (1) it cannot recognize the position/orientation of the face and its corresponding elements which leads to unpleasant generative content; (2) it fails to correctly recover the color of the lips; (3) it does not capture the full spatial correlations within neighboring pixels.
Similarly, Mathai et al. [83] use an encoder-decoder for the generator, a Patch-GANbased local discriminator, and a WGAN-GP [56]-based global discriminator to address occlusions on distinctive areas of a face and inpaint them. Consequently, the model's ability in recognizing faces improves. To minimize the effect of the masked area on the extracted features, two convolutional gating mechanisms are experimented: hard gating mechanism known as partial convolutions [62] and a soft gating method based on sigmoid function.
Liu et al. [84] also follow the same approach by implementing a generator (autoencoder), a local discriminator, and a global discriminator. A self-attention mechanism is applied in the global discriminator to enforce complex geometric constrains on the global image structure, and model long-range dependencies. The authors report the results for the facial landmark detection only, without providing the experimental data.
Moreover, Cai et al. [85] present FCSR-GAN to create a high-resolution deoccluded image from a low-resolution facial image with partial occlusions. At first, the model is pre-trained for face completion to recover the missing region. Afterward, the entire framework is trained end-to-end. The generator comprises a face completion unit and a face super-resolution unit. The low-resolution occluded input image is fed into the face completion module to fill the missing region. The face completion unit follows an encoderdecoder layout and the overall architecture is similar to the generative face completion by Li et al. [82]. Then, the occlusion-free image is fed into the face super-resolution module which adopts a SRGAN [86]. The network is trained with a local loss, a global loss, and a perceptual loss to ensure that the generated content is consistent with the local details and holistic contextual information. An additional face parsing loss and perceptual loss are computed to produce more realistic face images.
Furthermore, face completion can improve the resistance of face identification and recognition models to occlusion. The authors in [87] propose a two-unit de-occlusion distillation pipeline. In the de-occlusion unit, a GAN is implemented to recover the appearance of pixels covered by the mask. Similar to the previously mentioned works, the output of the generator is evaluated by local and global discriminators. In the distillation unit, a pre-trained face recognition model is employed as a teacher, and its knowledge is used to train the student model to identify masked faces by learning representations for recovered faces with similar clustering behaviors as the original ones. This teaches the student model how to fill in the information gap in appearance space and in identity space. The model is trained with a single occlusion mask at a time; however, in real-world instances, multiple masks cover large discriminative regions of the face.
Multiple generators: In contrast to the OA-GAN presented by Cai et al. [79], the authors of [88] propose a two-stage OA-GAN framework with two generators and two discriminators. While the generators (G 1 , and G 2 ) are made up of a UNet encoder-decoder architecture, PatchGAN is adopted in the discriminators. G 1 takes an occluded input image and disentangles the mask of the image to produce a synthesized occlusion. G 2 then takes the output from G 1 in order to remove the occlusions and generate a deoccluded image. Therefore, the occlusion generator (i.e., G 1 ) plays a fundamental role in the deocclusion process. The failure in the occlusion generator produces incorrect images.
Multiple generators and discriminators: While using multiple discriminators ensures the consistency and the validity of the produced image, some available works employ multiple generators, especially when tackling multiple problems. For example, Jabbar et al. [89] present a framework known as Automatic Mask Generation Network for Face Deocclusion using Stacked GAN (AFD-StackGAN) that is composed of two stages to automatically extract the mask of the occluded area and recover its content. The first stage employs an encoder-decoder in its generator to generate the binary segmentation mask for the invisible region. The produced mask is further refined with erosion and dilation morphological techniques. The second stage eliminates the mask object and regenerates the corrupted pixels through two pair of generators and discriminators. The occluded input image and the extracted occlusion mask are fed into the first generator to produce a completed image. The initial output from the first generator is enhanced by rectifying any missing or incorrect content in it. Two PatchGAN discriminators are implemented against the result of the generators to ensure that the restored face's appearance and structural consistency are retained. AFD-StackGAN can remove various types of occlusion masks in the facial images that cover a large area of the face. However, it is trained with synthetic data, and the incompatibility of the training images and the real-world testing images is likely.
In the same way, Li et al. [90] employ two generators and three domain-specific discriminators in their proposed framework called disentangling and fusing GAN (DF-GAN). They treat face completion as disentangling and fusing of clean faces and occlusions. This way, they remove the need for paired samples of occluded images and their congruent clean images. The framework works with three domains that correspond to the distribution of occluded faces, clean faces, and structured occlusions. In the disentangling module, an occluded facial image is fed into an encoder which encodes it to the disentangled representations. Thereafter, two decoders produce the corresponding deoccluded image and occlusion, respectively. In other words, the disentangling network learns how to separate the structured occlusions and the occlusion-free images. The fusing network, on the other hand, combines the latent representations of clean faces and occlusions, and creates the corresponding occluded facial image, i.e., it learns how to generate images with structured occlusions. However, real-world occlusions are of arbitrary shape and size, not necessarily structured.
Coarse-to-fine architecture: Conversely to the previously mentioned works where one output is generated, Jabbar et al. [91] propose a two-stage Face De-occlusion using Stacked Generative Adversarial Network (FD-StackGAN) model that follows the coarse-tofine approach. The model attempts to remove the occlusion mask and fill in the affected area. In the first stage, the network produces an initial deoccluded facial image. The second stage refines the initial generated image to create a more visually plausible image that is similar to the real image. Similar to AF-StackGAN, FD-StackGAN can handle various regions in the facial images with different structures and surrounding backgrounds. However, the model is trained on a synthetic dataset but it is not tested on images with natural occlusions.
Likewise, Duan and Zhang [92] address the problem of deoccluding and recognizing face profiles with large-pose variations and occlusions through BoostGAN, which has a coarse-to-fine structure. In the coarse part, i.e., multi-occlusion frontal view generator, an encoder-decoder network is used for eliminating occlusion and producing multiple intermediate deoccluded faces. Subsequently, the coarse outputs are refined through a boosting network for photo-realistic and identity-preserved face generation. Consequently, the discriminator has a multi-input structure.
Since BoostGAN is a one-stage framework, it cannot handle de-occlusion and frontalization concurrently, which means that it loses the discriminative identity information. Furthermore, BoostGAN fails to employ the mask guided noise prior information. To address these, Duan et al. [93] perform face frontalization and face completion simultaneously. They propose an end-to-end mask guided two-stage GAN (TSGAN) framework. Each stage has its own generator and discriminator, while the first stage contains the face deocclusion module, the second one contains face frontalization module. Another module named mask-attention module (MAM) is deployed in both stages. The MAM encourages the face deocclusion module to concentrate more on missing regions and fills them based on the masked image input. The recovered image is fed into the second stage to obtain the final frontal image. TSGAN is trained with defined occlusion types and specified sizes, and multiple natural occlusions are not considered. Table 3 provides an outline of the above-mentioned works, summarizing the type of GAN, the objective function, the dataset, and the results of each work.

Attribute Classification
With the availability of surveillance cameras, the task of object detection and tracking through its visual appearance in a surveillance footage has gained prominence. Furthermore, there are other characteristics of people that are essential to fully understand an observed scene. The task of recognizing the people attributes (age, sex, race, etc.) and the items they hold (backpacks, bags, phone, etc.) is called attribute classification.
However, occluding the person in question by another person may lead to incorrectly classifying the attributes of the occluder instead of the occludee. Furthermore, the quality of the images from the surveillance cameras is usually low. Therefore, Fabbri et al. [108] focus on the poor resolution and occlusion challenges in recognizing the attribute of people such as gender, race, clothing, etc., in surveillance systems. The authors propose a model based on DCGAN [109] to improve the quality of images in order to overcome the mentioned problems. The model has three networks, one for attribute classification from the full body images, and the other two networks attempt to enhance the resolution and recover from occlusion. Eliminating the occlusion produces an image without noise and the residual of other subjects that could result in misclassification. However, under severe occlusions, the reconstructed image still contains the remaining of the occluder and the model fails to keep the parts of the image that should stay unmodified.
Similarly, Fulgeri et al. [110] tackle the occlusion issue by implementing a combination of UNet and GAN architecture. The model requires as input the occluded person image and its corresponding attributes. The generator takes the input and restores the image. The output is then forwarded to three networks: ResNet-101 [77], VGG-16 [111], and the discriminator to calculate the loss. The loss is backpropagated to update the weights of the generator. The goal of the model is to obtain a result image of a person that (a) is not occluded, (b) is similar at the pixel level to a person shape, and (c) contains the similar visual features as the original image. The results show that the model can detect and remove occlusion without any additional information. However, the model fails to fully recover the pixels around the boundary of the body parts. The authors constraint the input images by not having occlusion of more than six-sevenths of the image height.

Miscellaneous Applications
In this section, we present the applications of GAN for amodal content completion in various categories of data.
Food: Papadopoulos et al. [112] present a compositional layer-based generative network called PizzaGAN that follows the steps of a recipe to make a pizza. The framework contains a pair of modules to add and remove all instances of each recipe component. A Cycle-GAN [6] is used to design each module. In the case of adding an element to the existing image, the module produces the appearance and the mask of the visible pixels in the new layer. Moreover, the removal module learns how to fill the holes that are left from the erased layer and generate the mask of the removed pixels. However, the authors do not provide any quantitative assessment of PizzaGAN.
Vehicles: Yan et al. [67] propose a two-part model to recover the amodal mask of a vehicle and the appearance of its hidden regions iteratively. To tackle both tasks, the model is composed of two parts: a segmentation completion module and an appearance recovery module. The first network is to complement the segmentation mask of the vehicle's invisible region. In order to complete the content of the occluded region, the appearance recovery module has a generator with a two-path network structure. The first path accepts the input image, the recovered mask from the segmentation completion module, and the modal mask, while learning how to fill in the colors of the hidden pixels. The other path requires the recovered mask and the ground-truth complete mask and learns how to use the image context to inpaint the whole foreground vehicle. The two paths share parameters, which increases the ability of the generator. To enhance the quality of the recovered image, it is taken through the whole model several times. However, the performance of the model degrades beyond three iterations for real images if occlusions are not severe.

Humans:
The process of matching the same person in images taken by multiple cameras is referred to as Person re-identification (ReID). In surveillance systems where the purpose is to track and identify the individuals, ReID is essential. However, the stored images usually have low resolution and are blurry because they are from ordinary surveillance cameras [113]. Additionally, occlusion by other individuals and/or objects is most likely to occur since each camera has a different angle of view. Hence, some important features become difficult to recognize.
To tackle the challenge of person re-identification under occlusion, Tagore et al. [114] design a bi-network architecture with an Occlusion Handling GAN (OHGAN) module. An image with synthetic added occlusion is fed into the generator which is based on UNet architecture and produces an occlusion-free image by learning a non-linear project mapping function between the input image and the output image. Afterward, the discriminator computes the metric difference between the generated image and the original one. The ablation studies for the reconstruction task illustrate that the quality of completion is good for 10-20% occlusion and average for 30-40% occlusion. However, the quality of reconstruction degrades for occlusions higher than 50%.
On the other hand, Zhang et al. [66] attempt to complete the mask and the appearance of an occluded human through a two-stage network. First, the amodal completion stage predicts the amodal mask of the occluded person. Afterward, the content recovery network completes the RGB appearance of the invisible area. The latter uses a UNet architecture in the generator, with local and global discriminators to ensure that the output image is consistent with the global semantics while enhancing the clarity and contrast of the local regions. The generator adds a Visible Guided Attention (VGA) module to the skip connections. The VGA module computes a relational feature map to guide the low-level features to complete by concatenating the high-level features with the next-level features. The relational feature map represents the relation between the pixels inside and outside the occluded area. The process of extracting feature maps is similar to the self-attention mechanism in SAGAN by Zhang et al. [54]. Although incorporating VGA leads to a more accurate recovery of the content and texture, the model does not perform well on real images as it does on synthetic images.

Training Data
Supervised learning frameworks require annotated ground-truth data to train a model. These data can be either from a manually annotated dataset, a synthetic occluded data from 3D computer-generated images, or by superimposing a part of an object/image on another object. For example, Ehsani et al. [76] train their model (SeGAN) on a photorealistic synthetic dataset, and Zhan et al. [75] apply a self-supervised approach to generate annotated training data. However, a model trained with synthetic data may fail when it is tested on real-world data, and human-labeled data are costly, time-consuming, and susceptible to subjective judgments.
In this section, we discuss how GAN is implemented to generate training data for several categories.
Generic objects: It is nearly impossible to cover all the probable occlusions, and the likelihood of appearance of some occlusion cases is rather small. Therefore, Wang et al. [115] aim to utilize the data to improve the performance of the object detection in the case of occlusions. They utilize an adversarial network to generate hard examples with occlusions, and use them to train a Fast-RCNN [116]. Consequently, the detector becomes invariant to occlusions and deformations. Their model contains an Adversarial Spatial Dropout Network (ASDN), which takes as input features from an image patch and predicts a dropout mask that is used to create occlusion such that it would be difficult for Fast-RCNN to classify.
Likewise, Han et al. [117] apply an adversarial network to produce occluded adversary samples to train an object detector. The model, named Feature Fusion and Adversary Networks (FFAN), is based on Faster RCNN [118] and consists of a feature fusion network and an adversary occlusion network, and while the feature fusion module produces a feature map of high resolution and high semantic information to detect small objects more effectively, the adversary occlusion module produces occlusion on the feature map of the object thus outputs an adversary training sample that would be hard for the detector to discriminate. Meanwhile, the detector becomes better in classifying the generated occluded adversary samples through self-learning. Over time, the detector and the adversary occlusion network learn and compete with each other to enhance the performance of the model.
The occlusions produced by adversary networks in [115,117] may lead to overgeneralization, because they are similar to other class instances. For example, the occluded wheels of a bicycle results in misclassifying a wheel chair as a bike.
Humans: Zhao et al. [119] augment the input data to produce easy-to-hard occluded samples with different sizes and positions of the occlusion mask to increase the variation of occlusion patterns. They address the issue of ReID under occlusion through an Incremental Generative Occlusion Adversarial Suppression (IGOAS) framework. The network contains two modules, an incremental generative occlusion (IGO) block, and a global adversarial suppression (G&A) module. IGO takes the input data through augmentation and generates easy occluded samples. Then, it progressively enlarges the size of the occlusion mask with the number of training iterations. Thus, the model becomes more robust against occlusion as it learns harder occlusion incrementally rather than hardest ones directly. On the other hand, G&A consists of a global branch which extracts global features of the input data, and an adversarial suppression branch that weakens the response of the occluded region to zero and strengthens the response to non-occluded areas.
Furthermore, to increase the number of samples per identity for person ReID, Wu et al. [120] use a GAN network to synthesize labeled occluded data. Specifically, the authors impose block rectangles on the images to create random occlusion on the original person images which the model then tries to complete. The completed images that are similar but not identical to the original input are labeled with the same annotation as the corresponding raw image. Similarly, Zhang et al. [113] follow the same strategy to expand the original training set, expect that an additional noise channel is applied on the generated data to adjust the label further. Both approaches in [113,120] work with rectangular masks, but in real-world examples occlusions appear in free-form shapes.
Face images: Cong and Zhou [106] propose an improved GAN to generate occluded face images. The model is based on DCGAN with an added S-coder. The purpose of the Scoder is to force the generator to produce multi-class target images. The network is further optimized through Wasserstein distance and the cycle consistency loss from CycleGAN. However, only sunglasses and facial masks are considered as occlusive elements. Figure 4 outlines of the discussed approaches for tackling the issues in overcoming occlusion through using GAN. Table 4 summarizes the GAN model, the loss function, and the datasets that were used in the discussed works in this section (except for the face completion works), it also shows the reported result for the tasks where GAN was applied. For amodal segmentation the implemented architecture are, a discriminator with a two hourglass generator [60], a coarse-to-fine architecture with contextual attention [63] or multiple discriminators [67], and a generator with priori knowledge [66]. For order recovery, GAN is designed as a generator with a single discriminator [71,73], or multiple discriminators [69,70]. To perform amodal content completion for facial images, the architectures include: a single generator and discriminator [79][80][81], multiple discriminators [82][83][84][85][86][87], multiple generators [88], multiple generators and discriminators [89,90], or a coarse-to-fine architecture [91][92][93]. Generic object completion is carried out through coarse-to-fine architecture [63], multiple discriminators with contextual attention [78], or partial convolution and CGAN [75,76]. Human completion for attribute classification is utilized in [108,110]. Other works use GAN to complete the images of food [112], vehicles [67], and humans [66,114]. GAN is also used to generate training data of generic objects [115,117], humans [113,119,120], and face images [106].

Loss Functions
In GAN, the generator G and the discriminator D play against each other in a twoplayer mini-max game until they reach Nash equilibrium through a gradient-based optimization method. The gradient of the loss value indicates the learning performance of the network. The loss value is calculated via a loss (objective) function. In fact, defining a loss function is one the fundamental elements of designing GAN. Consequently, numerous objective functions have been proposed to stabilize and regularize GAN. The following losses are the most common ones used in training GAN for amodal completion.

1.
Adversarial Loss: The loss function used in training GAN is known as an adversarial loss. It measures the distance between the distribution of the generated sample and the real sample. Each of G and D have their dedicated loss function which together form the adversarial loss, as shown in Equation (1). However, G is trained as the term that reflects the distribution of the generated data (E z∼p z (z) [log(1 − D(G(z)))]).
Extensions to the original loss function are the conditional loss and the Wasserstein loss defined in CGAN and WGAN, respectively.

2.
Content Loss: In image generation, content loss [138] measures the difference between the content representation of the real and the generated images, to make them more similar in terms of perceptual content. If p and x are the original and the generated images, and p l and X l are their respective representations in layer l, the content loss is calculated as 3.

Reconstruction Loss:
The key idea behind reconstruction loss proposed by Li et al. [139] is to benefit from the visual features learned by D from the training data. The extracted features from the real data by D are fed to G to regenerate real data. By adding reconstruction loss to the GAN's loss function, G is encouraged to reconstruct from the features of D, which brings G closer to the configurations of the real data. The reconstruction loss equation is as follows: where D φ F is a part of the discriminator which encodes the data to features, and G θ decodes the features to the training data.

4.
Style Loss: The style loss, originally designed for image style transfer by Gatys et al. [138], is defined to ensure that the style representation of the generated image matches that of the input style image. It depends on the feature correlation between the feature maps, given by the Gram matrix (G l ). Let a and x be the original image and the generated image, respectively, and A l and G l their corresponding style representation in layer l. The style loss is computed by the element-wise mean square difference between A l and G l , where w l is the weighting factor of each layer, and N and M represent the number and the size of the feature maps, respectively. 5.
L 1 and L 2 Loss: L 1 loss function is the absolute difference between the groundtruth and the generated image. On the other hand, L 2 loss is the squared difference between the actual and the generated data. When used alone, these loss functions lead to blurred results [140]. However, when combined with other loss functions, they can improve the quality of the generated images, especially L 1 loss. The generator is encouraged to not only fool the discriminator but also to be closer to the real data in L 1 or L 2 sense. Although these losses cannot capture high-frequency details, they accurately capture low frequencies. L 1 loss enforces correctness in low-frequency features; hence, it results in less blurred images compared to L 2 [8]. Both losses are defined in Equations (6) and (7).
where x, y, and z are the ground-truth image, the generated image, and the random noise, respectively. 6.
Perceptual Loss: The perceptual loss measures the high-level perceptual and semantic differences between the real and the fake images. Several works [141,142] introduce perceptual loss as a combination of the content loss (or feature reconstruction loss) and the style loss. However, Liu et al. [62] simply compute the L 1 distance between the real and the completed images. Others incorporate more similarity metrics into it [140]. 7.
BCE Loss: BCE loss measures how close the probability of the predicted data is to the real data. Its value increases as the predicted probability deviates from the real label. The BCE is defined as where y i is the label of i. y i =0 and y i =1 represents fake and real samples. BCE is used in training the discriminator in amodal segmentation task [76], and in training the generator [110]. 8.
Hinge Loss: In GAN, Hinge loss is used to help the convergence to a Nash equilibrium. Proposed by Lim and Ye [143], the objective function for G is and for D is where x and z are the ground-truth and the generated images, respectively.
As it can be seen from Tables 3 and 4, many of the previously mentioned loss functions are combined with others to train a GAN model. Adversarial loss is the base objective function for training the two networks of the GAN. However, with the original GAN's adversarial loss function, the model may not converge. Therefore, the Hinge loss is often implemented as an alternative objective function. In some works, global and local adversarial losses are used to train local and global discriminators to ensure that the generated data is semantically and locally coherent. In addition to this, L 1 or L 2 losses are frequently utilized to capture low-frequency features, and hence improve the quality of the generated images. Furthermore, the reconstruction loss is employed to encourage the generator to maintain the contents of the original input image. On the other hand, perceptual loss encourages the model to capture patch-level information when completing a missing patch in an object/image. Furthermore, to emphasize on the style match between the generated image and the input image, style loss is implemented.
The choice of the objective functions is an essential decision of designing a model. In amodal completion and inpainting, designing a loss function is still an active area of research. The ablation studies performed by the reviewed works show that there is no optimal objective function. For different tasks and data, a different set of loss terms produces the best results. In addition, using a complex loss function may lead to problems of instability, vanishing gradient, and mode collapse.

Open Challenges and Future Directions
Despite the significant progress of the research in GAN and amodal completion in the last decade, there remain a number of problems that can be considered as future directions.

1.
Amodal training data: Up until now, there has been no fully annotated generic amodal dataset with sufficient ground-truth labels for the three sub-tasks in amodal completion. Most of the existing datasets are specific to a particular application or task. This not only makes training the models themselves more difficult, but verifying their learning capability as well. In many cases, there is no sufficient labeled amodal validation data to establish the accuracy of the model. We present the challenges related to each sub-task in amodal completion. For amodal segmentation, the current datasets do not contain sufficient occlusion cases between similar objects. Hence, the model cannot tell where the boundary of one object ends and the other one begins. The existing real (manually annotated) amodal datasets have no ground-truth appearance for the occluded region. This makes training and validating the model for amodal content completion more challenging.
As for the case of order recovery, some occlusion situations are very rare in the existing datasets. On the other hand, it is impossible to cover all probable cases of occlusion in the real datasets. Nevertheless, in the future, the current datasets can be extended through generated occlusion to include more of those infrequent cases with varying degrees of occlusion.

2.
Evaluation metrics: There are several quantitative and qualitative evaluation measures for GAN [59]. However, as it can be noticed from the results, there is no standard and unanimous evaluation metric for assessing the performance of GAN when it generates the occluded content. Many existing works depend on the human preference judgement which can be biased and subjective. Therefore, designing a consensus evaluation metric is of utmost importance.

3.
Reference data: Existing GAN models fail to generate occluded content accurately if the hidden area is large. Particularly, when the occluded object is non-symmetric, such as the face or the human body. The visible region of the object may not hold sufficient relevant features to guide a visually plausible regeneration. As the next step, reference images can be used along the input image to guide the completion more effectively.
In addition to the above-mentioned problems, the challenges in the stability and convergence of GAN remain open issues [28].

Discussion
Current computational models approach the human capability of visible perception when performing visual tasks such as recognition, detection, and segmentation. However, our environment is complex and dynamic. Most of the objects we perceive are incomplete and fragmented. Therefore, the existing models that are designed and trained with a fully visible sample of instances do not perform well when tested on real-world scenes. Hence, overcoming occlusion is essential for leveraging the performance of available models. Amodal completion tasks address the occluded patches of an image to infer the occlusion relation between objects (i.e., order recovery), predict the full shape of the objects (i.e., amodal segmentation), and complete the RGB appearance of the missing pixels (i.e., amodal content completion). These tasks are usually interleaved and depend on each other. For example, amodal segmentation can benefit order recovery [144] and it is crucial for amodal content completion [76]. On the other hand, order recovery can guide the amodal segmentation [75].
Although GAN is notorious for its stability issues and is difficult to train, it is a popular approach for tasks that require generative capability. In handling occlusion, the initially incomplete representation needs to be extended to a complete representation with the miss-ing region filled in. Therefore, GAN is the chosen architecture for processes/sub-processes involved in amodal completion. However, depending on the nature of the problems, the applicability of GAN varies. For example, in amodal appearance reconstruction, GAN is the ideal option of architecture and it produces superior results in comparison to other methods. Comparably, in amodal segmentation and order recovery tasks, GAN is less commonly used. Nevertheless, to take advantage of the potential of GAN, it can be combined with other architectures and learning strategies to tackle those tasks too.
In order to help GAN in learning implicit features from the visible regions of the image, various methods are used, which can be summarized as follows: • Architecture: While the original GAN consists of a single generator and discriminator, several works utilize multiple generators and discriminators. The implementation of local and global discriminators is especially common, because it enhances the quality of the generated data. The generator is encouraged to concentrate on both the global contextual and local features, and produce images that are closer to the distribution of the real data. In addition to this, an initial-to-refined (also called coarse-to-fine) architecture is implemented in many models. The initial stage produces a coarse output from the input image, which is then further refined in the refinement step. • Objective function: To improve the quality of the generated output and stabilize the training of the GAN, a combination of loss terms is used. While adversarial loss and Hinge loss are used in training the two networks in the GAN, other objective functions encourage the model to produce an image that is consistent with the ground-truth image. • Input: Under severe occlusion, the GAN may fail to produce a visually pleasing output solely depending on the visible region. Therefore, providing additional input information guides GAN in producing better results. In the amodal shape and content completion, synthetic instances similar to the occluded object are useful, because they can be used as a reference by the model. A priori knowledge is also beneficial, as it can either be manually encoded (e.g., utilizing various human poses for human deocclusion) or transferred from a pre-trained model (e.g., using a pre-trained face recognition model in face deocclusion). In addition to these, employing the amodal mask and the category of the occluded object in the content completion task restricts the GAN model to focus on completing the object in question. For producing the amodal mask, a modal mask is needed as an input. If the input is not available, most of existing works depend on a pre-trained segmentation model to predict the visible segmentation mask. • Feature extraction: The pixels in the visible region of an image are rather important and contain essential information for various tasks; hence, they are considered as valid pixels. Contrary to this, the invisible pixels are invalid ones; hence, they should not be included in the feature extraction/encoding process. However, the vanilla convolution process cannot differentiate between valid and invalid pixels, which generates images with visual artifacts and color discrepancies. Therefore, partial convolution and a soft gating mechanism are implemented to enforce the generator to focus only on valid pixels and eliminate/minimize the effect of the invalid ones. On the other hand, dilated convolution layers can replace the vanilla convolution layers to borrow information from relevant spatially distant pixels. Additionally, contextual attention layers and attention mechanism are added to the networks of the GAN to leverage the information from the image context and capture global dependencies.
Among the various architectures of GAN, three types are most commonly used in the reviewed works in this article, namely CGAN, WGAN-GP, and PatchGAN. The application of CGAN is mostly in amodal content completion tasks, because the GAN is encouraged to complete an object of a specific class. WGAN-GP stabilizes the training of GAN with an EM distance objective function and a weight clipping method. Therefore, it is a preferred architecture to ensure GAN convergence. On the other hand, PatchGAN is used in designing the discriminator, as it attempts to classify patches of the generated image as real or fake. Consequently, the image is penalized for style consistency between pixels that are spatially more than a patch diameter away from each other.
Finally, handling occlusion is fundamental in several computer vision tasks. For example, completing an occluded facial image helps in better recognizing the face and predicting the identity of the person. Similarly, inferring the full shape of pedestrians and vehicles as well as the occlusion relationship between them can lead to a safer autonomous driving. Furthermore, in surveillance cameras, amodal completion helps in target tracking and security applications.

Conclusions
GANs are considered the most interesting idea in machine learning since their invention. Due to their generative capability, they are extending the ability of artificial intelligence systems. The GAN-based models are creative instead of mere learners. In the challenging field of amodal completion, GAN has had a significant impact especially in generating the appearance of a missing region. This brings existing vision systems closer to the human capability in predicting the occluded area.
To help the researchers in the field, in this survey we have reviewed the available works in the literature wherein a GAN is applied in accomplishing tasks of amodal completion and resolving the problems that arise when addressing occlusion. We discussed the architecture of each model along with its strengths and limitations in detail. Then, we summarized the loss function and the dataset that was used in each work and presented their results. Then, we discussed the most common types of objective functions which are implemented in training the GAN models for occlusion handling. Finally, we provided a discussion of the key findings of our survey article.
However, after reviewing the current progress in overcoming occlusion using a GAN, we detected several key issues that remain an open challenge in the research of addressing occlusion. These issues pave the way for the future research direction. By addressing them, the field will progress significantly. Data Availability Statement: The data will be made available upon request.