With the rapid advancement in deep learning algorithms, the tasks of analyzing and understanding digital images for many computer vision applications have drawn increasing attention in the recent years due to such algorithms’ extraordinary performance and availability of large amounts of data. Such algorithms directly process raw data (e.g., an RGB image) and obviate the need for domain experts or handcrafted features [1
]. The powerful ability of deep feature learning to automatically utilize complex and high-level feature representations has significantly advanced the performance of state-of-the-art methods across computer applications, such as object detection [4
], medical imaging [5
], image segmentation [7
], image classification [8
], and face detection [9
]. The underlying structure and distinctive (complex) features are both discovered via deep learning-based methods that can be classified further into discriminative feature-learning algorithms and generative feature-learning algorithms. Discriminative models focus on the classification-learning process by learning the conditional probability p (x|y) to map input x to class label y. One of the most popular methods used for image feature learning utilizes convolutional neural networks (CNN) for feature extraction and image classification. Examples include LeNet [8
], AlexNet [10
], VGGNet [11
], and ResNet [12
], all of which are supervised learning algorithms. On the other hand, generative models focus on the data distribution to discover the underlying features from large amounts of data in an unsupervised setting. Such models are able to generate new samples by learning the estimation of the joint probability distribution p (x,y) and predicting y [14
] in contexts, such as image super-resolution [15
], text-to-image generation [17
], and image-to-image translation [19
2. Deep Generative Models
Generative models can be categorized into traditional generative models based on machine learning algorithms and deep generative models that are based on deep learning algorithms [21
]. Traditional generative models use various forms of probability density function to approximate the distribution [22
], and cannot perform well on complex distributions. Such models include infinite Gaussian mixture models (GMM) [23
], the hidden naive Bayes model (NBM) [24
], and hidden Markov models (HMM) [25
]. Deep generative models utilize techniques, such as stochastic backpropagation, deep neural networks, and approximate Bayesian inference, in order to generate new samples based on variational distributions from large-scale datasets [26
]; examples of such models include the deep Boltzmann machine (DBM) [29
], deep belief networks (DBN) [30
], variational autoencoder (VAE) [31
], and generative adversarial networks GAN [32
]. The most dominant and efficient deep generative models of recent years have been VAE and GAN. A variational autoencoder learns the underlying probability distribution and generates a new sample that is based on Bayesian inference by maximizing the lower bound of the data’s log-likelihood. In contrast, generative adversarial networks learn data distributions through the adversarial training process based on game theory instead of maximizing the likelihood. The GAN approach offers several advantages over VAE-based models: (1) the ability to learn and model complex data distributions and (2) the ability to efficiently generate sharp and realistic samples [21
2.1. Generative Adversarial Networks
The generative adversarial network proposed by Goodfellow et al. in 2014 has been one of significant recent developments in the domain of unsupervised deep generative models [32
]. Figure 1
illustrates the architecture of a typical GAN. A GAN is composed of two competing neural networks inspired by the two-player minmax game: a generative network, called a generator and denoted G, and a discriminative network, called a discriminator and denoted D. The generator network tries to generate realistic samples to fool the discriminator, while the discriminator tries to distinguish real samples from fake samples. Generator and discriminator networks can both be any algorithms as long as the generator has the ability to learn the data distribution of the training data and the discriminator has the ability to extract the feature to classify the output. For instance, the generator network can be a de-convolutional neural network, and the discriminator network can be a convolutional neural network or a recurrent neural network [21
]. Thus, GANs can be used to generate multi-dimensional data distributions, e.g., images. GANs have been used to make promising contributions in variety of difficult generative tasks [35
], e.g., text-to-photo translation [18
], image generation [36
], image composition [37
], and image-to-image translation [38
]. Although GANs are one type of powerful deep generative models, the training of GANs suffers from several issues, such as mode collapse and training instability [39
], as discussed in Section 7.1
2.2. Image-To-Image Translation
The idea of image-to-image translation goes back to Hertzmann et al.’s image analogies [40
], a proposed non-parametric model using a pair of images to achieve image transformation [41
]. Many problems involving computer vision and computer graphics applications can be regarded as instances of the image-to-image translation problem. The task of Image-to-image translation is to learn the mapping from a given image (X) to a specific target image (Y), e.g., mapping grayscale images to RGB images. Learning the mapping from one visual representation to another requires an understanding of underlying features that are shared between these representations, such features are either domain-independent or domain-specific. Domain-independent features represent the underlying spatial structure and they should be preserved during translation (i.e., the content should be preserved when a natural image is translated to Van Gogh’s style), while domain-specific features are related to the rendering of the structure and they could be changed during translation (i.e., if the style should be changed when translating the image to Van Gogh’ styles) [42
]. However, learning the mapping between two or multiple domains is a challenging task for two reasons. First, collecting a pair of images may be difficult or the relevant images might sometimes not exist. The second difficulty is that in performing a multi-model translation, whereby one input image maps to multiple outputs. In recent years, GANs and their variants have been used to provide state-of-the-art solutions to image-to-image translation problems. This article classifies the proposed solutions according to two image-to-image translation settings, supervised and unsupervised Image-to-Image Translation setting, as explained in Section 5
In this section, notation, abbreviations, and concepts used throughout this survey are explained in order to facilitate the understanding of the topic. Lists of commonly used abbreviations and notations are shown in Table 1
and Table A1
, respectively. Concepts that are related to both generative adversarial networks and image-to-image translation are explained in what follows [42
Attribute: a meaningful feature, such as hair color, gender, size or age.
Domain: a set of images sharing similar attributes.
Unimodal image-to-image translation: a task in which the goal is to learn a one-to-one mapping. Given an input image in the source domain, the model learns to produce a deterministic output.
Multimodal image-to-image translation: aims to learn a one-to-many mapping between the source domain and the target domain with the goal of enabling the model to generate many diverse outputs.
Domain-independent features: those pertaining to the underlying spatial structure, known as the content code.
Domain-specific features: those pertaining to the rendering of the structure, known as the style code.
Image generation: a process of directly generating an image from a random noise vector.
Image translation: a process of generating an image from an existing image and modifying it to have specific attributes.
Paired image-to-image translation
: source images X and the corresponding images Y are provided as a training set of aligned image pairs, as shown in Figure 2
Unpaired image-to-image translation
: a source image X and a corresponding image Y are from two different domains, as shown in Figure 2
2.4. Motivation and Contribution
With development in deep learning, numerous approaches have been proposed in order to improve the quality of image synthesis. Most recently, image synthesis with deep generative adversarial networks has attracted many researchers’ attention due to their ability to capture the probability distribution as compared to traditional generative models. Several research papers have proposed performing image synthesis while using adversarial networks. Several articles [21
] have recently surveyed generative adversarial networks and GAN variants, including image synthesis as an application. Other surveys [50
] covered image synthesis with GANs and partially discussed image-to-image translation. The effort closest to this survey is that of [50
], where authors discussed image synthesis, including few image-to-image translation methods.
To the best of our knowledge, image-to-image translation with GANs has never been reviewed previously. Thus, this article provides a comprehensive overview of image-to-image translation using GAN algorithms and variants. A general introduction to generative adversarial networks is given, and GAN variants, structures, and objective functions are demonstrated. The image-to-image translation approaches are discussed in detail, including the state-of-the-art algorithms, theory, applications, and open challenges. The image-to-image translation approaches are classified into supervised and supervised types. The contributions of this article can be summarized, as follows:
This review article provides a comprehensive review including general generative adversarial network algorithms, objective function, and structure.
Image-to-image translation approaches are classified into supervised and unsupervised types with in-depth explanations.
This review article also summarizes the benchmark datasets, evaluation metric, and image-to-image translation applications.
Limitations, open challenges, and directions for future research are among the topics discussed, illustrated, and investigated in depth.
This paper’s structure can be summarized, as follows. In Section 2
, GANs and variants of GAN architectures are demonstrated. GAN objective functions and GAN structure are discussed in Section 3
and Section 4
, respectively. Section 5
introduces and discusses both supervised and unsupervised image-to-image translation techniques. In Section 6
, image-to-image translation applications, including the topics of datasets, practical applications, and evaluation metric, are illustrated and summarized. Discussion and Directions of future research utilizing reinforcement learning and three-dimensional (3D) models are discussed and summarized in Section 7
. The last section concludes this review paper.
4. GAN Objective Functions
The goal of the objective function used in GANs and its variants is to measure and minimize the distance between the real sample distribution and generated sample distribution. Although GANs have been successfully used in many tasks, there have been many problems that are caused by objective functions, such as gradient vanishing, and model collapse. These problems cannot be solved by modifying the GANs’ structure. Reformulating the objective function has been proven to alleviate these problems. Therefore, many objective functions have been proposed and categorized in order to improve the quality and diversity of the generated sample and avoid the limitations of the original GAN and its variants, as shown in Table 2
. Various objective functions are explained and discussed in what follows.
. The original GAN consists of generator
competing against each other. GANs utilize the sigmoid cross entropy as a loss function for discriminator D, and use the minmax loss and a non-saturated loss with generator G. D is a differential function whose inputs and parameters are x and
, respectively, which outputs a single scalar D(y). D(y) represents the probability that input (x) belongs to the real data
rather than the generated data
. Generator G is a differential function whose input and parameters are z and
, respectively. The discriminator and the generator both have separate loss functions, as shown in Table 3
. Both update their parameters to achieve the Nash equilibrium, whereby discriminator
aims to maximize the probability of assigning the correct label to both training samples and the sample generated by G [32
aims to minimize
to deceive D. They are both trained simultaneously and inspired by the min–max game.
Arjovsky et al. [62
] propose WGAN, using what is sometimes called the Earth Mover’s (ME) distance, to overcome the GAN model instability and mode collapse. WGAN uses the Wasserstein distance instead of that of Jensen–Shannan to measure the similarity between the real data distribution and generated data distribution. The Wasserstein distance can be used to measure the distance between probability distributions
(x) even if there is no overlap, where (
) denotes the set of all joint distributions between the real distribution and the generated distribution [34
]. WGAN applies a weight clipping to enforce the Lipschitz constraint on the discriminator. However, WGAN may suffer from gradient vanishing or exploding due to the use of weight clipping. The discriminator in WGAN is utilized as a regression task to approximate the Wasserstein distance instead of being a binary classifier [49
]. It should be noted that WGAN does not change the GAN structure, but instead enhances parameter learning and model optimization [22
]. WGAN-GP [69
] is a later proposal for stabilizing the training of a GAN by utilizing gradient penalty regularization [70
Least Squares GAN
. LSGAN [59
] has been proposed to overcome the vanishing gradient problem that is caused by the minimax loss and the non-saturated loss in the original GAN model. LSGAN adopts the least squares or L2 loss function instead of the sigmoid cross-entropy loss function used in the original GAN. As shown in Table 3
, a and b codes are the labels for generated and real samples, respectively, while c represents the value that G wishes D to believe for generated samples. There are two benefits of implementing LSGAN over the original GAN: first, LSGAN can generate high-quality samples; second, LSGAN allows for the learning process to be more stable.
. EBGAN [66
] has been proposed to model the discriminator as an energy function. This model also uses the autoencoder architecture to first estimate the reconstruction error and, second, to assign a lower energy to the real samples and a higher energy to the generated samples. The output of the discriminator in EBGAN goes through a loss function to shape the energy function. Table 3
shows the loss function, where m is the positive margin, and
. The EBGAN framework exhibits better convergence and scalability, which result in generating high-resolution images [66
Margin adaption GAN
. MAGAN [68
] is an extension of EBGAN that uses the hinge loss function, in which margin m is adapted automatically while using the expected energy of the real data distribution. Hence, margin m is monotonically reduced over time. Unlike EBGAN, MAGAN converges to its global optima, where both real and generated samples’ distributions match exactly.
Boundary Equilibrium GAN
. BEGAN [67
] is an extension of EBGAN that uses an autoencoder as the discriminator. BEGAN’s objective function computes the loss based on the Wasserstein distance. Using the proportional control theory, the authors of BEGAN propose an equilibrium method for balancing the generator and the discriminator during training without directly matching the data distributions [47
5. GAN Structure
The typical GAN is based on a multilayer perceptron (MLP), as mentioned above. Subsequently, structures of various types have been proposed to either solve GANs’ issues or address a specific application, as explained in what follows.
Deep convolutional GAN
. DCGAN [71
] is one of recent major improvements in the field of computer vision and generative modeling. It combines a GAN with a convolutional neural network (CNN). DCGAN has been proposed to stabilize GANs in order to train deep architectures to generate high-quality images. DCGANs have set some architectural constraints to train the generator and discriminator networks for unsupervised representation learning in order to resolve the issues of training instability and the collapse of the GAN architecture. First, DCGANs replace a spatial pooling layer with strided convolution and fractionally-strided convolution to allow for both the generator and the discriminator to learn its own spatial downsampling and spatial upsampling. Second, batch normalization (BN) is used to stabilize learning by normalizing the input and solve the vanishing gradient problem. BN is mainly applied to prevent the deep generator from collapsing all samples to the same points. Third, eliminating the fully connected hidden layers that would otherwise be on the top CNN increases model stability. Finally, DCGANs use both ReLU and LeakyReLU to allow the model to learn quickly and perform well. ReLU activation function is used in all generator layers, except the last layer, which uses the tanh activation function; additionally, LeakyReLU activation functions are used in all discriminator layers.
. SAGAN [72
] has been proposed to incorporate a self-attention mechanism into a convolutional GAN framework to improve the quality of generated images. A traditional convolution-based GAN has difficulty in modeling some image classes when trained on large multi-class image datasets due to the local receptive field. SAGAN adapts a self-attention mechanism to different stages of both the generator and the discriminator to model long-range and multi-level dependencies across image regions. SAGAN uses three techniques to stabilize the training of a GAN. First, it applies spectrally to both the generator and discriminator to improve performance and reduce the amount of computations that are performed during training. Second, it uses the Two Time-scale Update Rule (TTUR) for both the generator and discriminator to speed up the training of the regularized discriminator. Third, it integrates the conditional batch normalization layers into the generator and a projection into the discriminator. SAGAN utilizes the hinge loss as the adversarial loss and uses the Adam optimizer to train the model that achieves the state-of-the-art performance on class-condition image synthesis.
Progressive Growing GAN
. Karras et al. [73
] proposed PGGAN to generate large high-quality images and to stabilize the training process. The PGGAN architecture is based on growing both the generator and the discriminator progressively, starting from small-size images, and then adding new blocks of layers to both the generator and the discriminator. These new blocks of layers are incrementally enlarged in order to discover the large-scale structure and achieve high resolution. Progressive training has three benefits: stabilizing the learning process, increasing the resolution, and reducing the training time.
Laplacian Pyramid GAN
. Denton et al. [74
] proposed LAPGAN to generate high-quality images by combining a conditional GAN model with a Laplacian pyramid representation. A Laplacian pyramid is a linear invertible image representation that consists of a low-frequency residual based on a Gaussian pyramid. LAPGAN is a cascade of convolutional GANs with k levels of the Laplacian pyramid. The approach of LAPGAN uses multiple generators and discriminator networks and proceeds, as follows: in the beginning, the image is downsampled by a factor of two at each k level of the pyramid. Subsequently, the image is upsampled in a backward pass by a factor of two to reconstruct the image and then return it to its original size while the image acquires noise generated by a conditional GAN at each layer. LAPGAN is trained through unsupervised learning, and each level of a Laplacian pyramid is trained independently and evaluated while using both log-likelihood and human evaluation.
. Larsen et al. [75
] combine a variational autoencoder with a generative adversarial network (VAE-GAN) into an unsupervised generative model and train them jointly to produce high-quality generated images. VAE and GAN are implemented by assigning the VAE decoder to the GAN generator and combining the VAE’s objective function with an adversarial loss [47
], where the element-wise reconstruction metric is replaced with a feature-wise reconstruction metric in order to produce sharp images.
7. Image-to-Image Translation Applications
Image-to-image translation techniques have been successfully applied to a wide variety of real-world applications, as described below. This section summarizes three significant aspects in the respective subsections: benchmark datasets, evaluation metrics, and practical applications.
There are several benchmark datasets that are available for image synthesis tasks that can be utilized to perform image-to-image translation tasks. Such datasets differ in image counts, quality, resolution, complexity, and diversity, and they allow researchers to investigate a variety of practical applications such as facial attributes, cartoon faces, semantic applications, and urban scene analysis. Table 4
summarizes the selected benchmark datasets.
7.2. Evaluation Metrics
To quantitatively and qualitatively evaluate the performance of image-to-image translation, several evaluation metrics that are related to such translation are reviewed and discussed in what follows.
7.3. Practical Applications
Many computer vision and graphics problems can be regarded as image-to-image translation problems. Transferring an image from one scene to another can be viewed as cross-domain image-to-image translation, whereas a one-to-many translation is called multimodal image-to-image translation. There are many image-to-image translation applications, such as transferring a summer scene to a winter scene. In this section, only four widely known applications are covered: super-resolution, style transfer, object transfiguration, and medical imaging.
Super-resolution (SR) refers to the process of translating a low-resolution source image to a high-resolution target image. GANs have recently been used to solve super-resolution problems in an end-to-end manner [111
] by treating the generator as an SR model to output a high-resolution image and using the discriminator as a binary classifier. SRGAN [115
] adds a new distributional loss term in order to generate an upsampled image with the resolution increased fourfold and it is based on the DCGAN architecture with a residual block. ERSGAN [15
] has been proposed to improve the overall perceptual quality of SR results by introducing the residual-in-residual dense block (RRDB) without batch normalization in order to further enhance the quality of generated images.
7.3.2. Style Transfer
Style transfer is the process of rendering the content of an image with a specific style while preserving the content, as shown in Figure 6
. The earlier style transfer models could only generate one image and transfer it according to one style. However, recent studies attempt to transfer image content according to multiple styles that are based on a perceptual loss. In addition, with advancement in deep generative models, adversarial losses can also be used to train the style model to make the output image indistinguishable from images in the targeted style domain. Style transfer is a practical application of image-to-image translation. Chen et al. [116
] propose an adversarial gated networks, called Gated-GAN, to transfer multiple styles while using a single model based on three modalities: an encoder, a gated transformer, and a decoder. GANILLA [117
] is a proposed novel framework with the ability to better balance between content and style.
7.3.3. Object Transfiguration
Object transfiguration aims to detect the object of interest in an image and then transform it into another object in the target domain while preserving the background regions, e.g., transforming an apple into an orange, or a horse into a zebra. Using GANs has been explored in object transfiguration to perform two tasks: (1) to detect the object of interest in an image and (2) to transform the object into a target domain. Attention GANs [118
] are mostly used for object transfiguration; such a model consists of an attention network, a transformation network, and a discriminative network.
7.3.4. Medical Imaging
GANs have recently been used in medical imaging for two purposes: first, for discovering the underlying structure of the training data to generate new samples in order to overcome the privacy constraints and the lack or limited quantity of available positive cases, and, second, for detecting abnormal images through the discriminator [120
]. MedGAN [121
] is a medical image-to-image translation framework that is used to enhance further technical post-processing tasks that require globally consistent image properties. This framework utilizes style transfer losses to match the translated output with the desired target image according to style, texture, and content, as shown in Figure 7
8. Discussion and Directions for Future Research
shows an overview of recent studies’ structure and mechanisms to further investigate and compare the state-of-the-art image-to-image translation methods. There are two ways of performing image-to-image translation, namely, by supervised or unsupervised methods using either paired or unpaired datasets, as mentioned in the above classification. Although paired datasets have been used to obtain higher-quality outputs by conditional generative network models, it is expensive and difficult to collect such data for certain domains, or sometimes such data may not exist. Therefore, many recent studies [41
] propose methods for unsupervised image-to-image translation, using unpaired images. However, all of these approaches are only capable of learning deterministic translation of one-to-one mapping: e.g., each translation model associates a single input image with a single output image. Therefore, modeling more relationships across domains is more complex, and it is difficult to learn the underlying distribution between two different domains, known as the many-to-many mapping. The approaches of MUNIT and DIRT enable multi-modal translation by decomposing the latent code into a domain-invariant content space and a domain-specific style space; this leads to mapping ambiguity, since there is more than one proper output and many degrees of freedom in changing the outputs. Many proposed methods have attempted to overcome the limitations of image-to-image translation; however, there are three main open challenges that have not been fully addressed due to several reasons, as investigated and explained in the following section.
8.1. Open Challenges
Image-to-image translation with GANs and GAN variants usually suffers from the mode collapse issue that occurs when the generator only generates the same output, regardless of whether it uses a single input or operates on multiple modes, e.g., when two images I1
= G(c, z1
) and I2
= G(c, z2
) are likely to be mapped to the same model [107
]. There are two types of mode collapse: inter-mode and intra-mode collapse. Inter-mode collapse occurs when the expected output is known, e.g., if digits (0–9) are used and the generator keeps generating the same number to fool the discriminator. In contrast, intra-mode collapse usually happens if the generator only learns one style of the expected output to fool the discriminator. Many proposals have recently been made to alleviate and avoid mode collapse; the sample approaches include LSGAN [59
], using a mode-seeking regularization term [107
], and cycle consistency [84
]. However, the mode collapse problem still has not been completely solved and it is considered to be one of the open issues of image-to-image translation tasks.
As mentioned above, several evaluation methods have been proposed [104
] to measure and assess the quality of the translated images and investigate the strengths and limitations of the used models. These evaluation measures can be categorized into quantitative and qualitative. They have been further explored in [122
], where the difference between both of the techniques has been investigated in depth. Metrics of success of image-to-image translation usually evaluate the quality of generated images while using a limited number of test images or user studies. The evaluation of a limited number of test images must consider both style and content simultaneously, which is difficult to do [117
]. In addition, user studies are based on human judgment, which is a subjective metric [1
]. However, there is no well-defined evaluation metric, and it is still difficult to accurately assess the quality of generated images, since there is no strict one-to-one correspondence between the translated image and the input image [1
Image-to-image translation diversity is related to the quality of diverse generated outputs utilizing multi-modal and multi-domain mapping, as mentioned above. Several approaches [42
] injected a random noise vector into the generator in order to model a diverse distribution in the target domain. One of the existing limitations of image-to-image translation is the lack of diversity of generated images due to the lack of regularization between the random noise and the target domain [79
]. The DIRT [79
] and DIRT++ [95
] methods have been proposed to improve the diversity of generated images; however, generating diverse outputs for producing high quality and diverse images has not yet been fully explored.
8.2. Directions of Future Research
Deep reinforcement learning (LR) has recently drawn significant attention in the field of deep learning. Deep LR has been applied to a variety of image processing and computer vision tasks: image super-resolution, image denoising, and image restoration [123
]. A deep reinforcement learning framework can play a role in a GAN to generate high-quality output, whereby the generator can be utilized as an agent and the discriminator’s results as the reward signal. The generator should be trained as an agent and rewarded every time that it fools the discriminator, whereas the discriminator’s training process should be the same as proposed in general GANs to distinguish the generated distribution from the real distribution. Moreover, reconstructing a 3D shape from either a two-dimensional (2D) object or sketches has been a challenging task due to the lack of a proposed and explored 3D GAN in image-to-image translation. In addition, there is a lack of available 3D datasets. Exploring 3D GANs and volumetric convolutional networks can play a very important role in the future in generating 3D Images. Furthermore, cybersecurity applications should utilize image-to-image translation with GAN to design reliable and efficient systems. The image steganography that is based on GAN should be further investigated and developed in order to overcome critical cybersecurity issues and challenges.
Image-to-image translation with GANs has made huge success in computer vision applications. This article presents a comprehensive overview of GAN variants that are related to image-to-image translation based on algorithms, the objective function and structure. Recent state-of-the-art image-to-image translation techniques, both supervised and unsupervised, are surveyed and classified. In addition, benchmark datasets, evaluation metrics and practical applications are summarized. This review paper covers open issues that are related to mode collapse, evaluation metrics, and lack of diversity. Finally, reinforcement learning and 3D models have not been fully explored and are suggested as future directions towards better performance on image-to-image translation tasks. In the future, quantum generative adversarial network for image-to-image translation will be further explored and implemented in order to overcome complex problems related to image generation.