StyleGANs and Transfer Learning for Generating Synthetic Images in Industrial Applications

: Deep learning applications on computer vision involve the use of large-volume and representative data to obtain state-of-the-art results due to the massive number of parameters to optimise in deep models. However, data are limited with asymmetric distributions in industrial applications due to rare cases, legal restrictions, and high image-acquisition costs. Data augmentation based on deep learning generative adversarial networks, such as StyleGAN, has arisen as a way to create training data with symmetric distributions that may improve the generalisation capability of built models. StyleGAN generates highly realistic images in a variety of domains as a data augmentation strategy but requires a large amount of data to build image generators. Thus, transfer learning in conjunction with generative models are used to build models with small datasets. However, there are no reports on the impact of pre-trained generative models, using transfer learning. In this paper, we evaluate a StyleGAN generative model with transfer learning on different application domains—training with paintings, portraits, Pokémon, bedrooms, and cats—to generate target images with different levels of content variability: bean seeds (low variability), faces of subjects between 5 and 19 years old (medium variability), and charcoal (high variability). We used the ﬁrst version of StyleGAN due to the large number of publicly available pre-trained models. The Fréchet Inception Distance was used for evaluating the quality of synthetic images. We found that StyleGAN with transfer learning produced good quality images, being an alternative for generating realistic synthetic images in the evaluated domains.


Introduction
Deep learning methods, a subset of machine learning techniques, have achieved outstanding results on challenging computer vision problems, such as image classification, object detection, face recognition, and motion recognition, among others [1]. However, the use of deep learning requires a large volume of representative annotated data to learn general models that achieve accurate results [2]; data are still scarce with asymmetric distributions, i.e., disproportionate number of examples between classes, in most applications related to healthcare, security and industry, due to legal/ethical restrictions, unusual patterns/cases, and image annotation costs [3][4][5].
As an alternative, image data augmentation has emerged to create training data with symmetric distributions by increasing data and reducing overfitting in deep learning models [6]. Data augmentation has been used through simple transformations, such as rotations, mirroring, and noise addition [4]. However, simple transformations produce a reduced number of valid data which usually are highly correlated and produce overfit models with poor generalisation capacity.
Generative Adversarial Networks (GANs) [7] have emerged as an alternative to create synthetic images by learning the probability distribution from data and generating images with high diversity and low correlation that can be used to build deep learning models [4,5,[8][9][10][11][12][13]. GANs are used in medical applications, such as CT image segmentation [4,14] and disease/injure detection [12,13,[15][16][17]. Nevertheless, since GANs are deep-learning-based models, they also require a significant amount of data and computational time to be trained from scratch. This drawback limits the use of GANs in generating images in applications where data are scarce, such as security and industry. A way to cope with this disadvantage is the use of transfer-learning techniques, which allow building new models from pre-trained ones in other applications or source domains with an abundance of training data by transferring the main features and reducing the training time [2].
Transfer learning has been widely used to address image classification [18][19][20] and segmentation [21,22] problems. However, the effect of StyleGAN-transfer learning on generating image quality is poorly reported. Wang et al. [23] evaluated the transferability of features, using different source and target domains to build generative models applying transfer learning, with some limitations, such as the generation of low-resolution images and the lack of evaluation of the impact of content variability in target domains.
In this paper, we evaluate a data augmentation strategy based on transfer learning and StyleGAN [24]. We use the first version of StyleGAN due to the large number of publicly available pre-trained models. In particular, we evaluate the capability of StyleGAN and transfer learning to generate synthetic images (data augmentation) considering variability levels on content. Thus, we assess quantitatively and visually the quality of the generated images, using three target domains with fine-tuned StyleGANs from five pre-trained models-source domains. The evaluated target domains correspond to three image sets derived from industrial processes with different levels of content variability, shown in Figure 1: bean seeds (low variability), faces of people aged between 5 and 19 years (medium variability), and chars obtained during coal combustion (high variability). The assessed source domains, to transfer features and build generative models, correspond to five pretrained StyleGANs with images of paintings, portraits, Pokémon, bedrooms, and cats. Distinct from the commonly used transfer learning strategy, which consists of using related source and target domains, the evaluation is focused on source and target domains that are completely different. Obtaining the results shown, StyleGAN with transfer learning is suitable for the generation of high-resolution images in industrial applications due to having a good generalisation capability regarding content variability of the target image. The rest of the paper is structured as follows: Section 2 presents the theoretical background on StyleGAN, GANs assessment, and transfer learning. Section 3 summarises the relevant related works. Section 4 details the data augmentation strategy used as an evaluation pipeline. Section 5 describes the performed experiments and results. Section 6 presents the discussion on the obtained results, focusing on the effect of pre-trained models for synthetic image generation; Section 7 depicts the conclusions and future research lines.

StyleGAN
StyleGAN [24] combines the architecture of Progressive Growing GAN [25] with the style transfer principles [26] in Figure 2. StyleGAN's architecture addresses some limitations of the GAN models, such as stability during training and lack of control over the images generated.  [27] is composed of three neural networks. (a) Mapping network, which converts a random into a style signal. (b,c) Progressive generator network, which receives the style signal (A) and random noise (B), and produces images progressively. (d) The progressive discriminator network, which compares real and generated images to update all the weights for the three networks, improving their performance.
In a traditional GAN model, a generative network receives as input a random vector Z, or latent vector, to generate a new image. In contrast, in the StyleGAN architecture, a latent vector Z (512-dimensional) feeds an 8-layer neural network, called a mapping network, that transforms the latent vector into an intermediate space W (512-dimensional), which defines the style of the image to be generated; see Figure 2a.
An image style defined in the intermediate space W is transferred to the progressive generative network (Figure 2b), where the technique AdaIN (Adaptive Instance Normalization) [26] transforms the latent vector W into two scalars (scale and bias) that control the style of the image generated at each resolution level. In addition to the style guide provided by AdaIN, the progressive generator network has as an input a constant argument. This constant corresponds to an array of 4 × 4 × 512 dimensions, i.e., an image of 4 × 4 pixels with 512 channels, which is learned during network training and contains a kind of sketch with the general characteristics of the training set images. Furthermore, StyleGAN has noise sources injected at each resolution level to introduce slight variations in the generated images (Figure 2c). The improvements in the StyleGAN generative network allow optimising the quality of the generated synthetic images. Finally, the back propagation algorithm is applied to adjust the weights of the three networks, improving the quality of the images generated during the following iterations ( Figure 2d).

GANs Model Evaluation
Evaluating GAN architectures is particularly hard because there is not a consensus on a unique metric that assesses the quality and diversity of generated images [4,23].
However, a widely used metric in the literature is the Fréchet Inception Distance (FID) [28], defined as follows: where µ r , µ g are mean vectors and Σ r , Σ g are variance-covariance matrices. Equation (1) is a distance measurement between real and generated images. X r ∼ N(µ r , Σ r ) and X g ∼ N(µ g , Σ g ) are multidimensional normal distributions of 2048 dimensions of real and generated images, respectively, which are extracted from the third layer of polling of the Inception-v3 network [29]. The closer the distributions, the lower the metric value. FID values close to zero correspond to a larger similarity between real and generated images, resulting in the differences between the distributions.

Transfer Learning
Transfer learning models involve applying the knowledge learned in a source domainwhere a large amount of training data are available-to a target domain that has a reduced amount of data, as illustrated in Figure 3. The objective is to transfer most characteristics obtained from a source domain into a target domain in order to reduce training time and use deep learning models with limited data [30]. The transfer learning is defined as follows: given a source domain D S , a source learning task T S , a target domain D T , and a target learning task T T , the transfer learning aims to improve the learning of the target predictive function f T (.) in D T using the knowledge learned in D S and T S , where D S = D T or T S = T T . A domain D is defined by two components: a feature space X and a marginal probability distribution P(X), D = {X, P(X)}. Similarly, a learning task T consists of two components: a labels space Y and a target predictive function f (.), T = {Y, f (.)}.
In unsupervised models, such as GANs, the Y labels space does not exist, and the learning task objective is to estimate the generative distribution of data. For this purpose, a particular transfer learning technique, known as fine tuning, is used. Fine tuning is a transfer learning technique in which a model that has been trained for a specific task is used to perform a new task, with similar characteristics to the first task, by adjusting model parameters. This process means that a model is not built from scratch, but rather takes advantage of the characteristics learned from the original task.

Related Works
GANs are capable of image generation in two categories: low-resolution [4,8,12,13,15,16,23] and high-resolution [14,17,[31][32][33][34][35][36][37][38][39][40][41]. A summary of these approaches is presented in Table 1. The first group comprises mostly studies between 2017 and 2018. A predominance of medical image augmentation focus on cancer detection [4,15], cerebral diseases [16] and COVID-19 [12,13] is observed. The image generation is motivated by the high cost of medical images acquisition that translates into a limited number of samples to train deep learning models. Publications include studies that evaluated the effectiveness of GANs data augmentation, using computer vision benchmarks datasets [8,23]. Particularly, the work of Wang et al. [23] corresponds to the only attempt to use transfer learning with generative models. This study concluded that it is possible to apply transfer learning in a Wasserstein GAN model [42], using source and target domains with low-resolution images.
The second group includes studies from 2019 to date, driven by new GAN architectures, such as StyleGAN [24], generating images with high-resolution. There are three application areas: agriculture [31,32], medical [14,17,36,37], and electrical domains [35]. Particularly, Fetty et al. [17] presented a complete analysis of StyleGAN models trained from scratch for data augmentation of pelvic malignancies images.
Transfer learning on generative models for limited data has been the subject of study for the last three years [33,34,[38][39][40][41], focusing on evaluating the impact of freezing the lower generator layers [33,34], the lower discriminator layers [39], and both the generator and discriminator lower layers [40], using mainly general purposes datasets of indoors (e.g., LSUN, Bedroons) and faces (e.g., CelebHQ, FFHQ, CelebA). The results show a reduction in the overfitting derived from the knowledge transfer and training time. However, transfer learning in conjunction with generative models has not been evaluated with a focus on the capability to generate synthetic images with high-resolution, considering the variability levels of content.
We aim to fill this literature gap using the proposed pipeline to evaluate the transfer capability of the knowledge obtained from certain source domains to target domains with different levels of content variability, such as bean seed images (with simple shape, texture, and colour), young faces and chars images (with more complex visual features). These target domains correspond to real industrial applications.

Evaluation Synthetic Images Generation Pipeline
We proposed a five-step pipeline based on the fine tuning of StyleGAN pre-trained models from five source domains, as is shown in Figure 4-paintings, portraits, Pokémon, bedrooms, and cats-in order to generate synthetic images in three target domains: bean seeds, young faces, and chars. Although, there is a new version of StyleGAN, called StyleGAN2 [43], we selected the first version of StyleGAN, due to the large number of publicly available pre-trained StyleGAN models. Moreover, in practice, training a StyleGAN model from scratch requires a huge number of images, computational resources (preferably with multiple GPUs) and processing time.
In the proposed image generation pipeline, first, we select the images target domain of bean seeds, young faces, or chars, as input. Second, images from the target domain are pre-processed to improve the transferability of features by adjusting image resolution. Third, pre-trained StyleGAN models are fine tuned with pre-processed images. Fourth, the FID evaluation metric is used for selecting the best source domain. Fifth, the synthetic images for the input target domain are generated with the best source domain.

Input Target Domain Images
We used images from three application domains-bean seeds, young faces and charswith different variability levels of content and potential industrial applications; see Figure 1.
Bean seeds: The dataset [44] has 1500 seed images from 16 bean varieties. Bean seed images are classified as low content variability since shape, colour, and texture characteristics are homogeneous for the analysed seed varieties, corresponding to oval shapes with limited range of red, cream, black, and white colours. In addition, the acquired images share the same background colour. Synthetic images of bean seeds are valuable in developing evaluation tools of genetic breeding trials. These tools are used to preserve the genetic pedigree of seeds over time, accomplish market quality requirements, and increase production levels [45].
Young faces: The images set consists of 3000 images randomly selected from publicly available datasets with reference to age estimation problems: IMDB-WIKI [46], APPA-Real [47], AgeDB [48]. Images correspond to individuals aged between 5 and 19 years. This range of ages was selected because it presents the lowest frequencies in the considered facial datasets. In addition, young faces are crucial in cybersecurity applications, such as access control, the detection of Child Sexual Exploitation Material or the identification of victims of child abuse [49][50][51]. Young facial images are considered to be of medium variability content since faces have a similar shape structure; the StyleGAN model was originally designed for faces generation.
Chars: The dataset contains 2928 segmented char particle images from coals of high, medium, and low reactivity. Char images are considered to be of high variability content due to the complex particle shapes and lack of colours. Synthetic images of char particles are useful to train models to estimate the combustion parameters in power generation plants [52]. Table 2 contains a summary of the target domains, including the source, number of images, number of classes, and content variability type.

Pre-Processing Target Domain Images
Input images are pre-processed to improve the transfer of features from the source domain images into the target domain. The pre-processing consists of equally adjusting the images resolution to the resolution of the source domain. In short, operations of resizing, up-scaling, or down-scaling are applied, depending on the difference between the target and source images' dimensions.

Transfer Learning from Source Domains
We used five pre-trained StyleGAN models-paintings, portraits, Pokémon, bedrooms, and cats-shown in Figure 5. The selection of pre-trained models was based on the (i) public availability of models, and (ii) diversity of images used to build models. Table 3 presents relevant information about pre-trained models: the images source, required images resolution, and the number of iterations used for training.

Paintings
Portraits Pokémon Bedrooms Cats Figure 5. Illustration of generated images using the training source domain models.

Source Domain Image Resolution Number of Iterations
Paintings [53] 512 × 512 8040 Portraits [54] 512 × 512 11,125 Pokemon [55] 512 × 512 7961 Bedrooms [56] 256 × 256 7000 Cats [56] 256 × 256 7000 Transfer learning is performed by fine tuning the pre-trained StyleGAN models (source domains) with images of the target domain to build new image generators. During the fine tuning of StyleGAN models, the learning rate is set to 0.001 and the number of minibatch repetitions is set to 1 based on those reported in [57]. The selected values for the learning rate and minibatch repetition increase the stability and speed during training.

Selection of the Best Source Domain
Once the StyleGAN models are fine tuned, the best source domain is selected based on the best FID metric value to generate images of a target domain. In particular, the StyleGAN model with the lowest FID value has a better performance since the model generates images with a distribution similar to the target domain.

Synthetic Images Generation (Output)
The StyleGAN with the best performance is used to generate as many images as needed for the target domain (data augmentation).

Experimental Evaluation
We assess the transfer learning capability of pre-trained models from five source domains-paintings, portraits, Pokémon, bedrooms, and cats-to build StyleGAN models for generating images of unrelated target domains with different levels of content variability: bean seeds, young faces, and chars. Pre-processed images from target domains are used to fine tune pre-trained StyleGAN models (source domains) over 1000 iterations, using the hyperparameters described in Section 4.3. We run the experiments on a GNU/Linux machine with a GPU Nvidia TITAN Xp 11GB, Cuda 10.1, and CuNDD 7. The source code is available at https://github.com/haachicanoy/stylegan_augmentation_tl (accessed on 28 July 2021). Table 4 presents the FID values obtained for the fine-tuned StyleGAN models, and Table 5 illustrates the generated images by the target domain.  Table 5. Generated images for target domain using fine tuned StyleGAN models.

Source Target Bean Seeds Young Faces Chars
Original image

Cats
The results show that StyleGAN models are able to generate bean seed images (low content variability) through transfer learning with excellent performance. FID values range between 23.26 and 57.92, corresponding to the source domains of paintings and cats, respectively. Hence, the best source domain to generate bean seed images is paintings. It indicates that the colour pattern in these images is more similar to beans in comparison to the other source domains.
Regarding the generation of young face images (medium content variability), the bedrooms source domain achieves the best results (FID of 16.98). Bedrooms are one of the most complete source domains, standing out for colour and shape features that are efficiently transferred to generate facial images. It is essential to highlight that the five source domains yield FID values lower than 30.11. This performance may be related to the fact that the StyleGAN architecture was specifically developed for generating synthetic face images.
During the fine tuning of models to generate char images, we observed that the source domains of Pokémon and portraits do not converge, leading the training to fail. This is presumably because of the high content variability of chars with complex shapes, varying sizes, and changes in colour intensities that make it difficult to adapt the features from these two source domains. The remaining source domains (paintings, bedrooms, and cats) have features that can be successfully transferred to the generation of char images. In particular, bedrooms achieve the best performance (FID of 34.81).
In most of the cases, fine tuned models generate images of bean seeds, young faces and chars with visual characteristics that are similar to the original ones (see Table 5). However, in some cases, the generated images have visual defects; see Figure 6. Defects on bean seed images comprise bean shape deformation, stains, and changes in colour intensities. Defects on young facial images correspond to colour spots and alterations in the hair, skin, and smile. Defects on chars images include blurring and undefined shapes. Therefore, the generated images have to be filtered to remove images with defects before using them in any application, e.g., the training of an image classifier. A sample of 1000 synthetic images of target domains generated from the best source domains is available at https://doi.org/10.7910/DVN/HHSJY8 (accessed on 23 July 2021).

Bean seeds
Young faces Chars Figure 6. Defects in generated images by target domain. Figure 7 shows the evolution of the FID metric across the iterations of fine tune StyleGAN models for evaluating target domains (bean seeds, chars and young faces). The results show that, in most cases, the trained models for a target domain-regardless of the source domain-tend to stabilise and converge to a constant FID value, after a certain number of iterations. This occurs for all cases, except for target domains with high content variability (chars) where some source domains do not converge-portraits and Pokémon. Hence, we conclude that transfer learning and generative models, such as StyleGAN, can be successfully used to build generators of images with low and medium content variability, such as seeds and faces. However, the generation of synthetic images with high content variability is limited by the characteristics of source domains. In particular, the best FID values are obtained for the source domains of paintings (bean seeds) and bedrooms (chars and young faces). We also analysed the loss score values of the generator and discriminator networks during training, shown in Figure 8. Similar to the observed for the FID values, the loss scores of the source domains exhibit a steady behaviour, except for the cats' domain, indicating an instability around 500 iterations.

Effect of Pre-Trained Models on Synthetic Images Generation
Furthermore, the use of transfer learning reduces significantly the number of images (up to 1500 for beans) and iterations (up to 1000 or 2 days) required to build StyleGAN models, in comparison to models trained from scratch (70,000 images and 14 days) [24].
Regarding the quality of the synthetic images, Figure 9 presents a bar graph of the FID values by source domains-paintings, portraits, Pokémon, bedrooms, and cats-grouped by the target domain: bean seeds, chars, and young faces. The bar highs correspond to median values of the FID obtained during the fine tuning of StyleGAN models by a source domain, while the black line on the bars represents dispersion of the median FID values. The length of the line denotes the range of the dispersion. Larger lines indicate that the images generated from a source domain differ significantly from the target domain.
The source domains with the best performance-lower median FID value and dispersion-are bedrooms, for the young faces and chars, and paintings, for the bean seeds. Despite bedrooms yielding the lowest FID value in the target domain of chars, this source domain is the one with the largest dispersion, indicating a possible generation of char images with defects.

Conclusions
StyleGAN with transfer learning is a strategy for generating synthetic images with a limit number of images from the target domain. We evaluated the application of StyleGAN with transfer learning on generating high-resolution images by a pipeline based on the fine tuning of StyleGAN models. The evaluation was conducted using three target domains from industrial applications with different content variability (bean seeds, chars, and young faces) and five source domains from general applications (paintings, portraits, Pokémon, bedrooms, and cats) to perform transfer learning.
The experimental evaluation confirmed the potential of StyleGAN with transfer learning for generating synthetic images for industrial applications. The proposed pipeline performed better with target domains with low and medium content variability in terms of colour and shape, such as bean seeds and young faces. Moreover, the time and number of images required to build the models were reduced in all cases, which validates the use of StyleGAN with transfer learning for generating synthetic images.
As future work, strategies to optimise the fine tuning hyper-parameters will be evaluated to improve the performance of image generators with high content variability and reduce the defects in synthetic images. General purpose datasets will be assessed, such as FFHQ and LSUN.