A Comparative Study of Engraved-Digit Data Augmentation by Generative Adversarial Networks

: In cases where an efﬁcient information retrieval (IR) system retrieves information from images with engraved digits, as found on medicines, creams, ointments, and gels in squeeze tubes, the system needs to be trained on a large dataset. One of the system applications is to automatically retrieve the expiry date to ascertain the efﬁcacy of the medicine. For expiry dates expressed in engraved digits, it is difﬁcult to collect the digit images. In our study, we evaluated the augmentation performance for a limited, engraved-digit dataset using various generative adversarial networks (GANs). Our study contributes to the choice of an effective GAN for engraved-digit image data augmentation. We conclude that Wasserstein GAN with a gradient norm penalty (WGAN-GP) is a suitable data augmentation technique to address the challenge of producing a large, realistic, but synthetic dataset. Our results show that the stability of WGAN-GP aids in the production of high-quality data with an average Fr é chet inception distance (FID) value of 1.5298 across images of 10 digits (0–9) that are nearly indistinguishable from our original dataset.


Introduction
As machine learning and big data engineering have grown in popularity, information retrieval (IR) from various sources has become a subject of discussion. In general, an efficient IR system based on machine learning techniques requires a large collection of data sources because it learns by training with a large amount of data. To date, several issues have been raised in the early stages of applying recommender systems in the healthcare industry to assist health practitioners and users in making efficient and accurate healthrelated decisions [1]. We are particularly concerned about the implications of insufficient datasets used to train some of these models. One of the system applications involves the retrieval of expiry dates found on daily-life products administered to the human body. It is important to help people notice the expiry date, as it can help convey additional information about the product and regulate their health. We introduced generative adversarial network (GAN) as a way to use IT technology to support the continuous healthy life of people. The disabled, the elderly, and patients with vision loss, in particular, have a difficulty in checking dates because the digits are small and sometimes blurry. Expiry dates are often expressed as engraved digits, as shown in Figure 1. To automatically retrieve the engraved digits in expiry dates, we need a large dataset of engraved digits to train the IR system. We collected image (photo) data from medicines, consumables, cosmetic products, and tube-type ointments to recognize the expiration dates expressed in engraved digits. However, the performance of the classification results was not good because of the small amount of data. Our dataset had images that showed both the properties of digits and unusual shadows created by the engraved shape in the images, and it was differentiated from the MNIST dataset [2] by these properties. The popularity of generative adversarial networks (GANs) is increasing nowadays, and these are used to generate synthetic datasets that are close and almost indistinguishable from the real data [3]. GAN is a type of machine learning technique that involves two models being trained simultaneously: the generator, G, and the discriminator, D. G is trained to generate fake data samples, and D is trained to distinguish between fake and real data [3,4]. These two models are trained together in an adversarial zero-sum game until the discriminator model is tricked half of the time, implying that the generator model generates satisfactory outputs [3,4]. This means that D effectively distinguishes between actual and false samples. It is either rewarded or no changes to the model parameters are required, whereas G is penalized with significant model parameter changes. D determines whether a batch of samples produced by G is authentic or fake by comparing it with actual samples from the original datasets. To improve its ability to distinguish between genuine and fake samples in the following round, D is updated. More crucially, G is also updated based on whether D is successfully tricked by the generated samples [3,4]. Therefore, the fundamental building blocks of GANs include the generator G, discriminator, D or critic, C, and the associated loss functions. The G model is used to generate new plausible examples from the problem domain, and the D model is used to classify examples from the domain as real or fake. The loss function helps to evaluate the model performance [3,5]. It measures the accuracy of the model in terms of predicting the expected outcome. Basically, it improves the stability of the trained GAN model.
The GAN architecture is shown in Figure 2. First, we sample the noise, z, using a normal or uniform distribution. With z as an input, we use a generator, G, to create an image x (x = G(z)) after performing multiple transposed convolutions to upsample z. The discriminator processes the real images (training samples) and generated images separately. It distinguishes whether the input image is real or generated. The output D(x) is the probability that the input x is real. If the input is real, D(x) should equal 1. If it is generated, it should be zero. Through this process, the discriminator identifies the features that distinguish real images from generated images. We train the generator to create images that may be interpreted as real by the discriminator. We train both networks in alternating steps and lock them into a fierce competition to improve themselves. Eventually, the discriminator detects a small difference between the real and generated images, and the generator generates images that cannot be distinguished by the discriminator. The GAN model repeats this process until the loss converges to zero, and the distributions are approximately equal. There are several application domains of GANs, and significant progress has been made in areas, such as image super-resolution, art creation, speech synthesis, image-toimage translation, video prediction, and 3D object generation [3,4,[7][8][9][10]. GANs often evaluate input images and leverage convolutional neural networks (CNNs) as generator and discriminator models, and the performance of these neural networks often improves with the amount of available data. Data augmentation is a technique to artificially generate data. Data augmentation using GANs helps to generate more plausible training datasets from existing data [7][8][9][10]. This helps to develop better models by improving the model characteristics and providing a regularizing effect, which reduces the generalization error.
In addition to GANs, restricted Boltzmann machines (RBM), generative stochastic networks (GSN), deep belief networks (DBN), deep Boltzmann machines (DBM), and variational autoencoders are used in deep learning [4,11]. Unlike classic augmentation algorithms, GANs can handle invariances represented by simple transformations and strengthen weak points in the learned decision boundaries [7,12]. In addition, GANs have exhibited tremendous computation speed and quality of results compared with other methods [7]. GANs generate new training data, which lead to better classification performance [4,7]. To select the best performing model for data augmentation, the output from each model is evaluated by human visual inspection [13][14][15][16] and by calculating the Fréchet inception distance (FID). The FID is a metric used in GAN to compute the Wasserstein distance between feature vectors calculated between a fake and real image [14,[17][18][19]. With the FID, it is possible to identify the output images that show close diversity to the original images. A very low score indicates that the output and original images are identical and have comparable statistics [18]. The FID is expressed as given by Equation (1): where x and y are the real and generated samples, respectively, i.e., the activations from the pretrained Inception-v3 model [20], and μx and μy are the feature-wise means of the real and generated images, respectively. Tr is the trace of the matrix, and ∑x and ∑y are the covariance matrices of the vectors [18][19][20][21]. In [19], the FID evaluation measure was empirically demonstrated to be a viable metric because of its resistance to mode dropping and encoding networks. The inception score (IS) is another metric that is used to evaluate the image quality and diversity. The IS can be used to evaluate clear images generated by GANs, and the model used to train a GAN should output a high diversity of images from all different classes in ImageNet [13,22]. The most common use of IS is in non-ImageNet datasets for generative models trained on CIFAR-10 [22]. This is because CIFAR-10 is significantly smaller and is more manageable for training than ImageNet. Hence, applying the IS to generative models trained on datasets other than ImageNet yields misleading results [13,23]. The IS score is high for clear images that are well classified into specified object types [23,24]. Owing to the generated images not being present in the classifier's training data, the IS is not ideal for our dataset.
In this study, we leveraged GAN models to produce more datasets to provide accurate results for automatic recognition of expiry-dates in photos, which requires a large There are several application domains of GANs, and significant progress has been made in areas, such as image super-resolution, art creation, speech synthesis, image-toimage translation, video prediction, and 3D object generation [3,4,[7][8][9][10]. GANs often evaluate input images and leverage convolutional neural networks (CNNs) as generator and discriminator models, and the performance of these neural networks often improves with the amount of available data. Data augmentation is a technique to artificially generate data. Data augmentation using GANs helps to generate more plausible training datasets from existing data [7][8][9][10]. This helps to develop better models by improving the model characteristics and providing a regularizing effect, which reduces the generalization error.
In addition to GANs, restricted Boltzmann machines (RBM), generative stochastic networks (GSN), deep belief networks (DBN), deep Boltzmann machines (DBM), and variational autoencoders are used in deep learning [4,11]. Unlike classic augmentation algorithms, GANs can handle invariances represented by simple transformations and strengthen weak points in the learned decision boundaries [7,12]. In addition, GANs have exhibited tremendous computation speed and quality of results compared with other methods [7]. GANs generate new training data, which lead to better classification performance [4,7]. To select the best performing model for data augmentation, the output from each model is evaluated by human visual inspection [13][14][15][16] and by calculating the Fréchet inception distance (FID). The FID is a metric used in GAN to compute the Wasserstein distance between feature vectors calculated between a fake and real image [14,[17][18][19]. With the FID, it is possible to identify the output images that show close diversity to the original images. A very low score indicates that the output and original images are identical and have comparable statistics [18]. The FID is expressed as given by Equation (1): (1) where x and y are the real and generated samples, respectively, i.e., the activations from the pretrained Inception-v3 model [20], and µ x and µ y are the feature-wise means of the real and generated images, respectively. Tr is the trace of the matrix, and ∑x and ∑y are the covariance matrices of the vectors [18][19][20][21]. In [19], the FID evaluation measure was empirically demonstrated to be a viable metric because of its resistance to mode dropping and encoding networks.
The inception score (IS) is another metric that is used to evaluate the image quality and diversity. The IS can be used to evaluate clear images generated by GANs, and the model used to train a GAN should output a high diversity of images from all different classes in ImageNet [13,22]. The most common use of IS is in non-ImageNet datasets for generative models trained on CIFAR-10 [22]. This is because CIFAR-10 is significantly smaller and is more manageable for training than ImageNet. Hence, applying the IS to generative models trained on datasets other than ImageNet yields misleading results [13,23]. The IS score is high for clear images that are well classified into specified object types [23,24]. Owing to the generated images not being present in the classifier's training data, the IS is not ideal for our dataset.
In this study, we leveraged GAN models to produce more datasets to provide accurate results for automatic recognition of expiry-dates in photos, which requires a large amount of learning data. We implemented and evaluated state-of-the-art GAN models, such as WGAN [5], WGAN-GP [21], WGAN-DIV [25], MMGAN [26], NSGAN [26], LSGAN [27], DRAGAN [28], ACGAN [29], DCGAN [30], EBGAN [31], VAE [32], and BEGAN [33], to augment the data for our dataset. We concluded that Wasserstein GAN with a gradient norm penalty (WGAN-GP) is a suitable data augmentation technique for our dataset. Our results show that the stability of WGAN-GP aids in the production of not only high-quality data but also images of a variety of styles. The images have an average Fréchet inception distance (FID) value of 1.5298 across 10 digits (0-9) that are nearly indistinguishable from our original dataset.
The remainder of this paper is structured as follows. The background section introduces a GAN for data augmentation and describes two metrices to evaluate the GAN performance: FID and inception score (IS). In the section on relevant literature, we introduce recent studies on data augmentation using GAN and state-of-the-art GAN models. In the section, Data Augmentation for Engraved-Digit Images, we establish the method of analysis and discuss the evaluation metrices that are used. In the Evaluation and Analysis section, we provide brief details of the WGAN-GP model, describe its hyperparameters, and present loss plots to demonstrate optimal results. Finally, we summarize our study in the Conclusions section.

Data Augmentation by GAN
Researchers have proposed and applied several GAN models to supplement datasets that are insufficient for classification tasks. These datasets vary across different domain areas of application, and the GAN models that are the most suitable for the augmentation process are often adopted. The augmented datasets are applied to a classifier for validation, and a set of benchmark results for the datasets is presented, which may then be used to further characterize and validate the datasets. Our work targets a dataset of few-digit images that are engraved on medicines, ointments, and other forms of squeeze tubes, and the images are usually distorted and blurry. We evaluated the state-of-the-art GAN models to determine the most preferred model for the data augmentation task.
Wang et al. [34] integrated a GAN into an all-encompassing information-retrieval framework. Their approach demonstrated that GAN-based information retrieval systems are promising, but more work is required. The idea of data augmentation across several domains and simple models aided by large datasets can be extremely helpful in improving the performance of object detection applications [5,10,35]. In [36], the authors evaluated the effectiveness of two augmentation methods for CNN-based MODI script handwritten character recognition using the same CNN architecture. They confirmed that the on-the-fly data augmentation technique was more accurate than an offline approach. This strategy enabled the network to view a fresh collection of data each time, boosting the effectiveness of the system. Using the standard MNIST handwritten digit dataset in [37], the authors conducted an experimental evaluation of the advantages of data augmentation for convolutional backpropagation-trained neural networks, convolutional support vector machines, and convolutional extreme learning machine classifiers. Their work showed that in cases where reasonable data transforms are available, augmentation in the data space offers an advantage for enhancing performance and reducing overfitting. The authors investigated the advantages of training a machine learning classifier with examples that were artificially manufactured to supplement the data. In addition, in [38], the authors demonstrated the method of training a deep neural network (DNN) for optical character recognition (OCR) on historical manuscripts with a small amount of data. They examined various methods for data augmentation of palimpsests and evaluated the impact of various techniques on the performance of the DNN. In [8], the authors demonstrated the effective enhancement of conventional vanilla classifiers by a data augmentation generative adversarial network (DAGAN). Additionally, they demonstrated the usage of a DAGAN to improve few-shot learning systems, such as matching networks. Their approach was evaluated on the Omniglot [39], EMNIST [40], and VGG-face datasets [41].
In recent times, GANs have been extensively used in medical applications. In [10], the authors presented a literature review on the application of GANs in ophthalmology image domains to discuss key contributions and suggested probable future research trajectories. Whereas medical records can contain sensitive and personal data of patients, training of GANs with these original datasets may be regulated and a need for synthetic datasets for research in this domain may arise [42]. Other applications of GANs in medicine are discussed in [43,44].

GAN Models for IR
The Wasserstein GAN with gradient penalty, WGAN-GP [21], was proposed as an alternative for clipping weights, i.e., to penalize the norm of the gradient of the critic with respect to its input. This model has made progress towards the stable training of GANs with almost no hyperparameter tuning, unlike WGAN [5], which suffers from training instability and sometimes generates poor samples or suffers convergence failure. WGAN-GP proposes a gradient penalty to be added to the WGAN discriminator loss as an alternative method for enforcing the Lipschitz constraint, which was previously performed by weight clipping. This penalty does not suffer from bias of the discriminator toward simple functions owing to weight clipping. Additionally, the reformulation of the discriminator by adding a gradient penalty term makes batch normalization unnecessary. Wasserstein divergence, WGANDIV [25], a symmetric divergence, has been proved to faithfully approximate the corresponding Wasserstein divergence through optimization. It has been demonstrated to be stable under various settings, including progressive growing training and has exhibited superior results when compared with state-of-the-art methods, both quantitatively and qualitatively.
In NSGAN [26], two models are simultaneously trained: G, which captures the data distribution, and D, which estimates the probability that a sample is from the training data rather than G. The objective of the training process for G is to maximize the probability that D makes a mistake. The generator loss is the only difference when compared with the MMGAN [26]. In both the NSGAN and MMGAN, the output of G can be interpreted as a probability. However, the output of D in LSGAN [27] is unbounded unless it passes through an activation function. A sigmoid activation function has also been implemented. LSGAN tackles the vanishing gradient problem associated with GANs by swapping out the cross-entropy loss function with the least-squares (L2) loss function. In [27], the authors claimed that the L2 loss function penalizes samples that are identified by the discriminator as real but located outside the decision boundary. Hence, the generated visuals are meant to closely resemble real data. It also stabilizes the training process. The D in DRAGAN [28] can be interpreted as a probability, similar to the D models in MMGAN and NSGAN. DRAGAN is similar to WGAN-GP but seems to be less stable. This model is very similar to WGAN-GP as it applies a gradient penalty to try and obtain an improved training objective based on the optimal performance of D and G. The gradient penalty is only applied close to the real data manifold, whereas WGAN-GP selects the gradient location on a random line between a real and randomly generated fake sample.
The auxiliary classifier GAN (AC-GAN) [29] is an extension of the conditional GAN that changes the discriminator to predict the class label of a given image rather than receive it as an input. It stabilizes the training process and allows the generation of large highquality images while learning a representation in the latent space that is independent of the class label. A deep convolutional GAN (DCGAN) is a generative adversarial network with convolutional neural networks as the generator and discriminator [30]. The DCGAN was proposed after evaluating a set of constraints on the architectural topology of convolutional GANs that allowed them to be trained in a stable manner in most settings. The trained discriminators were used for image classification tasks, showing competitive performance with other unsupervised algorithms. To model the discriminator, D(x), as an energy function that assigns low energies to regions close to the data manifold and greater energies to other regions, a family of GANs called EBGANs has been proposed [31]. The discriminator uses an autoencoder. The encoder extracts latent features from the input image, and subsequently, the decoder performs image reconstruction. Instead of a probability value similar to the original GAN, the discriminator outputs the reconstruction mean square error, MSE, between the input image and reconstructed image VAEs [32]. Thus, it encodes an input into a given dimension, z, reparametrizes z using its mean and standard deviation and then reconstructs the image from the reparametrized z. BEGAN [33] optimizes a lower bound of the Wasserstein distance between autoencoder loss distributions on the original and generated data, using an autoencoder as a discriminator. To maintain the discriminator and generator in equilibrium, the authors added an additional hyperparameter γ ∈ [0, 1].

Data Augmentation for Engraved-Digit Images
The objective of this study is to compare the state-of-art GAN models and identify the models that are suitable for data augmentation for our specific types of datasets after evaluating the generated data images, variety of the images produced, and performance of the GAN model from the FID score after training. The proposed GAN models were trained using optimal tuning with the same epoch time.

Engraved-Digit Image Dataset
As shown in Figure 3, our dataset has the properties of both digits and unusual shadows by engraved shapes in the images. They were collected from medicines, gels, and tube-type ointments. Classification models perform poorly when insufficient or blurry data are used for model training. We trained these limited datasets on state-of-the-art GAN models to produce high-quality grayscale fake images. The images were separated into classes of 0-9. We collected approximately a hundred images per digit, which were selectively trained one class at a time to evaluate the diversity of the images produced by each GAN. The data were pre-processed for standardization. We reshaped and converted all input images to a grayscale of 128 × 128 × 1 in the range of [−1, 1]. This was done to maintain the same color in the images. To model the discriminator, D(x), as an energy function that assigns low energies to regions close to the data manifold and greater energies to other regions, a family of GANs called EBGANs has been proposed [31]. The discriminator uses an autoencoder. The encoder extracts latent features from the input image, and subsequently, the decoder performs image reconstruction. Instead of a probability value similar to the original GAN, the discriminator outputs the reconstruction mean square error, MSE, between the input image and reconstructed image VAEs [32]. Thus, it encodes an input into a given dimension, z, reparametrizes z using its mean and standard deviation and then reconstructs the image from the reparametrized z. BEGAN [33] optimizes a lower bound of the Wasserstein distance between autoencoder loss distributions on the original and generated data, using an autoencoder as a discriminator. To maintain the discriminator and generator in equilibrium, the authors added an additional hyperparameter γ ∈ [0, 1].

Data Augmentation for Engraved-Digit Images
The objective of this study is to compare the state-of-art GAN models and identify the models that are suitable for data augmentation for our specific types of datasets after evaluating the generated data images, variety of the images produced, and performance of the GAN model from the FID score after training. The proposed GAN models were trained using optimal tuning with the same epoch time.

Engraved-Digit Image Dataset
As shown in Figure 3, our dataset has the properties of both digits and unusual shadows by engraved shapes in the images. They were collected from medicines, gels, and tube-type ointments. Classification models perform poorly when insufficient or blurry data are used for model training. We trained these limited datasets on state-of-the-art GAN models to produce high-quality grayscale fake images. The images were separated into classes of 0-9. We collected approximately a hundred images per digit, which were selectively trained one class at a time to evaluate the diversity of the images produced by each GAN. The data were pre-processed for standardization. We reshaped and converted all input images to a grayscale of 128 × 128 × 1 in the range of [-1, 1]. This was done to maintain the same color in the images.

Data Augmentation for Our Dataset by GANs
We trained our dataset using the following state-of-the-art GAN models: MMGAN, NSGAN, LSGAN, ACGAN, DCGAN, WGAN, WGAN-GP, WGANDIV, DRAGAN, BEGAN, EBGAN, and VAE. We implemented these models and examined the images produced by each GAN. In addition, we closely analyzed the graph showing the relationship between the generator and discriminator loss plots of each GAN.

A Classification of GANs
As shown in Table 1, we classified the GAN models into four groups. Group 1 is based on the way D, also known as the critic C, wants to maximize the expression to differentiate between the real and fake images. Group 2 is composed of models based on the way D estimates the probability that a sample originates from the training data rather than G. The training procedure for G maximizes the probability of D making a mistake. Group 3 is based on D's main task, i.e., to perform image classification. The basic architecture of the GANs in Groups 1, 2, and 3 is shown in Figure 2. In contrast, the GANs in Group 4 have an autoencoder in their architecture. The VAE architecture comprises an encoder and a decoder trained to minimize the reconstruction loss, with the input being encoded as a distribution over the latent space. Figure 4 shows the embedding of autoencoders in the EBGAN and BEGAN architectures. Both GANs can be represented using the same architecture; however, in each GAN, the reconstruction loss is calculated differently. Whereas EBGAN uses the mean square error (MSE) to calculate the reconstruction loss, BEGAN uses the Wasserstein distance. BEGAN, EBGAN, and VAE. We implemented these models and examined the images produced by each GAN. In addition, we closely analyzed the graph showing the relationship between the generator and discriminator loss plots of each GAN.

A Classification of GANs
As shown in Table 1, we classified the GAN models into four groups. Group 1 is based on the way D, also known as the critic C, wants to maximize the expression to differentiate between the real and fake images. Group 2 is composed of models based on the way D estimates the probability that a sample originates from the training data rather than G. The training procedure for G maximizes the probability of D making a mistake. Group 3 is based on D's main task, i.e., to perform image classification. The basic architecture of the GANs in Groups 1, 2, and 3 is shown in Figure 2. In contrast, the GANs in Group 4 have an autoencoder in their architecture. The VAE architecture comprises an encoder and a decoder trained to minimize the reconstruction loss, with the input being encoded as a distribution over the latent space. Figure 4 shows the embedding of autoencoders in the EBGAN and BEGAN architectures. Both GANs can be represented using the same architecture; however, in each GAN, the reconstruction loss is calculated differently. Whereas EBGAN uses the mean square error (MSE) to calculate the reconstruction loss, BEGAN uses the Wasserstein distance.

Digit Image Generation by GANs
The 0-9-digit images generated with our dataset by the GANs are shown in Table 2, along with some selected outputs. The number under each image is the epoch in which the image was generated. We selected the best and most stable images produced during 50,000 epochs of execution. We trained each class of our datasets (digits 0-9) for 50,000 epochs with sample images saved every 1000 epochs.

Digit Image Generation by GANs
The 0-9-digit images generated with our dataset by the GANs are shown in Table 2, along with some selected outputs. The number under each image is the epoch in which the image was generated. We selected the best and most stable images produced during 50,000 epochs of execution. We trained each class of our datasets (digits 0-9) for 50,000 epochs with sample images saved every 1000 epochs.

FIDs and ISs
We calculated the FID [10,17], which measures the statistical difference between the features of our original data and generated images, as shown in Table 3. FID addresses the flaw of IS, in which the statistics of the original and generated samples are not compared [20]. A perfect score of 0.0 indicates that the two sets of images are identical, whereas lower scores indicate that the two groups of images are more similar or have more in common statistically. Table 4 lists the IS values calculated for each GAN output image. The average IS calculated across all GANs in Table 4 is approximately 1.0. IS measures the realism of a GAN output by using a pre-trained deep learning neural network model to predict the class probabilities for each class of the generated image, as opposed to FID, which is calculated by comparing the statistics of the generated samples with the original samples. A high IS score is an excellent score and usually ranges from 1 to the number of classes in a pretrained network [23]. In our case, the IS is very low because the generated images were not present in the training data of the classifier. Hence, the IS is not ideal for our dataset. We calculated the FID [10,17], which measures the statistical difference between the features of our original data and generated images, as shown in Table 3. FID addresses the flaw of IS, in which the statistics of the original and generated samples are not com-50,000 We calculated the FID [10,17], which measures the statistical difference between the features of our original data and generated images, as shown in Table 3. FID addresses the flaw of IS, in which the statistics of the original and generated samples are not com- We calculated the FID [10,17], which measures the statistical difference between the features of our original data and generated images, as shown in Table 3. FID addresses the flaw of IS, in which the statistics of the original and generated samples are not com- We calculated the FID [10,17], which measures the statistical difference between the features of our original data and generated images, as shown in Table 3. FID addresses the flaw of IS, in which the statistics of the original and generated samples are not com-

Evaluation and Analysis
We evaluated the images by visually inspecting the quality of the images in Table 2 and the average FID score of the GAN model, as shown in the last column of Table 3.
In Table 2, the BEGAN and VAE outputs are not considered suitable for data augmentation, even though their FID values are the lowest at 1.0945 and 0.484, respectively. This is because the images are of poor quality and have little or no diversity. The DRAGAN model is similar to WGAN-GP, but the generated images are not of good visual quality compared with the output images from WGAN-GP and WGANDIV. Considering our evaluation metrices, both WGAN-GP and WGANDIV outputs are the most preferred. However, WGANDIV slightly outperforms WGAN-GP considering the FID score. The corresponding average FID scores of WGAN-GP and WGANDIV are 1.5298 and 1.4933, respectively, which corroborates their image quality. The low FID values translate to smaller distances between the fake and original data distributions. Figures 5-7 are the plots for WGAN-GP, WGANDIV and BEGAN for digits 3 images, respectively while Figures 8-10 shows the plot for WGAN-GP, WGANDIV and BEGAN for digits 0 images, respectively. However, we prefer WGAN-GP simply because of the noticeable quality of images across the digits 0-9. While each GAN displays similarities in its digit-plots from 0-9, Figures 5-7 show the loss plots for both WGAN-GP, WGANDIV and BEGAN for the digit-3 images, and Figures 8-10 do for the digit-0 images, respectively. We have only selected the digit 3 and 0, but the WGAN-GP plots have shown consistency and stability for the entire digit images in the range 0-9. Figures 7 and 10, the WGAN_GP plots in Figures 5 and 8 show the increase in discriminator losses (D_Loss), whereas there is a decrease in the generator losses (G_Loss). In all WGAN-GP plots in our experiment, it is evident that whereas the discriminator attempts to maximize its ability to detect real data from fake data, the generator attempts to minimize the distance between the generated and real data.    Unlike BEGAN in Figures 7 and 10, the WGAN_GP plots in Figures 5 and 8 show the increase in discriminator losses (D_Loss), whereas there is a decrease in the generator losses (G_Loss). In all WGAN-GP plots in our experiment, it is evident that whereas the discriminator attempts to maximize its ability to detect real data from fake data, the generator attempts to minimize the distance between the generated and real data.    Unlike BEGAN in Figures 7 and 10, the WGAN_GP plots in Figures 5 and 8 show the increase in discriminator losses (D_Loss), whereas there is a decrease in the generator losses (G_Loss). In all WGAN-GP plots in our experiment, it is evident that whereas the discriminator attempts to maximize its ability to detect real data from fake data, the generator attempts to minimize the distance between the generated and real data.        The performance of a GAN model strongly depends on the dataset, and models are evaluated based on the quality of images generated in the context of a specific targeted domain [19]. After execution, the WGAN-GP on our dataset achieves an FID score of 1.5298. This score and the generated images reflect the high quality of the desired data required for augmentation. Hence, we conclude that WGAN-GP is an appropriate model for replenishing our dataset. In Table 5 [19], WGAN-GP shows good performance in terms of the FID score across several other datasets. We compared this result with the FID score of only one class of our dataset: digit 3. The FID scores obtained in a large-scale hyperparameter search for the FASHION (60,000 training datasets), CIFAR-10 (6000 images/class), and CELEBA datasets (202,599 face images) are displayed [19]. The best scores are the WGAN-GP scores of 24.5, 55.8, 30, and 0.761 for the FASHION, CIFAR-10, CELEBA, and ENGRAVED DIGITS datasets, respectively.    The performance of a GAN model strongly depends on the dataset, and models are evaluated based on the quality of images generated in the context of a specific targeted domain [19]. After execution, the WGAN-GP on our dataset achieves an FID score of 1.5298. This score and the generated images reflect the high quality of the desired data required for augmentation. Hence, we conclude that WGAN-GP is an appropriate model for replenishing our dataset. In Table 5 [19], WGAN-GP shows good performance in terms of the FID score across several other datasets. We compared this result with the FID score of only one class of our dataset: digit 3. The FID scores obtained in a large-scale hyperparameter search for the FASHION (60,000 training datasets), CIFAR-10 (6000 images/class), and CELEBA datasets (202,599 face images) are displayed [19]. The best scores are the WGAN-GP scores of 24.5, 55.8, 30, and 0.761 for the FASHION, CIFAR-10, CELEBA, and ENGRAVED DIGITS datasets, respectively. Unlike BEGAN in Figures 7 and 10, the WGAN_GP plots in Figures 5 and 8 show the increase in discriminator losses (D_Loss), whereas there is a decrease in the generator losses (G_Loss). In all WGAN-GP plots in our experiment, it is evident that whereas the discriminator attempts to maximize its ability to detect real data from fake data, the generator attempts to minimize the distance between the generated and real data.

Unlike BEGAN in
The performance of a GAN model strongly depends on the dataset, and models are evaluated based on the quality of images generated in the context of a specific targeted domain [19]. After execution, the WGAN-GP on our dataset achieves an FID score of 1.5298. This score and the generated images reflect the high quality of the desired data required for augmentation. Hence, we conclude that WGAN-GP is an appropriate model for replenishing our dataset. In Table 5 [19], WGAN-GP shows good performance in terms of the FID score across several other datasets. We compared this result with the FID score of only one class of our dataset: digit 3. The FID scores obtained in a large-scale hyperparameter search for the FASHION (60,000 training datasets), CIFAR-10 (6000 images/class), and CELEBA datasets (202,599 face images) are displayed [19]. The best scores are the WGAN-GP scores of 24.5, 55.8, 30, and 0.761 for the FASHION, CIFAR-10, CELEBA, and ENGRAVED DIGITS datasets, respectively.
The WGAN-GP we implemented is shown in Figure 11 and Algorithm 1 in detail. During the implementation, the discriminator network (C) was first trained on a real batch of data (x), then it was trained on a batch of data generated from the noise (Z) via the generator (G). This was required to provide a random weighted average between real and generated image samples needed for gradient norm penalty. The discriminator's loss function was arranged such that it estimates the Wasserstein Distance with gradient penalty. The gradient penalty was computed for a batch of average samples to ensure it is 1-Lipschitz-Continuous and was first computed using prediction and weighted real/fake samples. The performance of Algorithm 1 depends on the number of iterations for θ converge, n θ , the number of critic iterations per generator iteration, η critic , and the batch size, m: O(n θ * η critic * m). The WGAN-GP we implemented is shown in Figure 11 and Algorithm 1 in detail. During the implementation, the discriminator network (C) was first trained on a real batch of data (x), then it was trained on a batch of data generated from the noise (Z) via the generator (G). This was required to provide a random weighted average between real and generated image samples needed for gradient norm penalty. The discriminator's loss function was arranged such that it estimates the Wasserstein Distance with gradient penalty. The gradient penalty was computed for a batch of average samples to ensure it is 1-Lipschitz-Continuous and was first computed using prediction and weighted real / fake samples. The performance of Algorithm 1 depends on the number of iterations for converge, , the number of critic iterations per generator iteration, , and the batch size, : ( * * ).

Conclusions
In this study, we investigated data augmentation for engraved-digit image datasets on expiry date. Limited, fuzzy, and distorted digit images were required to improve image identification of consumables and cosmetic products. We evaluated the state-of-theart GAN models, MMGAN, NSGAN, LSGAN, ACGAN, DCGAN, WGAN, WGAN-GP, WGANDIV, DRAGAN, BEGAN, EBGAN, and VAE by visually inspecting the results and calculating the FID values for each GAN. WGAN-GP and WGANDIV show stability and are suitable for the data augmentation task; however, we consider WGAN-GP to be the preferred GAN owing to its quality output images after visual inspection. The consistency and stability of its G_Loss and D_Loss plots over the digits 0-9 is also satisfactory.
Our future research would focus on automatic recognition of engraved expiry-date digit images. After augmenting the few datasets to an abundance with WGANGP. We intend to build a recognition model and train the model with these synthetic datasets to a very high degree of confidence. We conjecture that the image quality and diversity from WGAN-GP would contribute to our model's stability. The new recognition model would not only be tailored to engraved-digit image recognition, but it would also serve as a benchmark for models required to train similar datasets to high performance.