License Plate Image Generation using Generative Adversarial Networks for End-To-End License Plate Character Recognition from a Small Set of Real Images

: License Plate Character Recognition (LPCR) is a technology for reading vehicle registration plates using optical character recognition from images and videos, and it has a long history due to its usefulness. While LPCR has been signiﬁcantly improved with the advance of deep learning, training deep networks for LPCR module requires a large number of license plate (LP) images and their annotations. Unlike other public datasets of vehicle information, each LP has a unique combination of characters and numbers depending on the country or the region. Therefore, collecting a sufﬁcient number of LP images is extremely difﬁcult for normal research. In this paper, we propose LP-GAN, an LP image generation method, by applying an ensemble of generative adversarial networks (GAN), and we also propose a modiﬁed lightweight YOLOv2 model for an efﬁcient end-to-end LPCR module. With only 159 real LP images available online, thousands of synthetic LP images were generated by using LP-GAN. The generated images not only looked similar to real ones, but they were also shown to be effective for training the LPCR module. As a result of performance tests with 22,117 real LP images, the LPCR module trained with only the generated synthetic dataset achieved 98.72% overall accuracy, which is comparable to that of training with a real LP image dataset. In addition, we improved the processing speed of LPCR about 1.7 times faster than that of the original YOLOv2 model by using the proposed lightweight model.


Introduction
License plate (LP) character information is uniquely assigned so that each vehicle on the road can be identified. Therefore, it is widely used for vehicle recognition in situations such as toll charges on highways, speed and signal violations, and illegal parking detection [1][2][3]. Due to the serious number of traffic-related problems caused by the rapid increase in vehicles, research is currently underway to improve the traffic environment and LP information is used as an important source of information [4]. For this reason, the study of Automatic LP Recognition (ALPR) has been underway for a long time and is continuing to this day. License plate character recognition (LPCR) is a technology for reading vehicle registration plate character information using optical character recognition. In the conventional LPCR process, each character is segmented from the LP image and the character recognition is performed for individual characters. The end-to-end LPCR method using an object detector based on convolutional neural network (CNN) can recognize LP character information by performing character segmentation and character recognition simultaneously from the LP image. However, this method requires a significantly large number of LP images and their annotations of character information for network training. Unlike other public datasets of vehicle information, each LP has a unique combination of characters and numbers depending on the country or the region.
In the following situation, as an example, it is necessary to develop an LPCR module of the ALPR system for Korean LPs. After determining a method and an algorithm for developing the LPCR, a sufficient sample of real LP images is needed for development and testing. If machine learning or deep learning in the algorithm is decided to use for the LPCR, more LP image data are needed, but unfortunately, it is extremely difficult to obtain enough real LP image data. Since it takes a lot of time, effort, and money to obtain enough LP image data to develop a practical LPCR module, we can instead search for public LP datasets. Table 1 shows the available image datasets related to LP from Caltech Cars dataset [5] released from 1999 to the present. Most of the datasets have hundreds of images and UFPR-ALPR [6] dataset offers 4500 images. However, these LP images are not suitable for training LPCR of Korean LPs that consist of numbers and Korean characters, because these datasets are composed of numbers and English or Chinese characters. Thus, Korean LP images had to be collected through web-scraping, but only a few hundred were available.
The main contribution is three-fold. First, an LP generator based on GAN was made by a small set of real LP images, and generating realistic LP images that have the desired character information for use as training data for the LPCR module of the ALPR system. LP-GAN can be trained using an extremely limited amount of LP image data and can then generate realistic LP images. The generated LP images were used as training data for the LPCR, and were confirmed through experiments that character information could be effectively recognized from a real LP image. In addition, it is shown that the performance of the LPCR could be improved by an ensemble of LP images generated by various LP-GAN generators rather than a single one. Figure 1 shows sample LP images generated by various GAN methods and real LP images. LP-GAN can easily generate LP images of any character combinations, and the generated images look very real.
Second, an object detector based on the Convolutional Neural Networks (CNN) was developed which is able to perform character segmentation and character recognition simultaneously. These tasks are carried out separately in the traditional ALPR systems. For LPCR methods using conventional optical character recognition approaches, the segmentation of each character in the LP image must precede the character recognition. Therefore, successful character segmentation greatly affects the LPCR performance. However, the CNN-based object detector of an LPCR is performed with a segmentation-free end-to-end manner that is not affected by character segmentation problems.
Third, an extensive test of algorithms was performed with 22,117 real LP images under the various conditions. The real LP images were not used in any of the training phases of LP-GAN-based LPCR module. As an LPCR module trained with the generated dataset using LP-GAN achieved comparable accuracy to that trained with a real LP image dataset, it was successfully shown the high feasibility of data generation using GAN for LP images.
The rest of this paper is organized as follows. Section 2 reviews the GAN models studied so far, especially image-to-image translation methods and recent studies related to LP recognition technology, including traditional LP recognition methods. The three GAN-based image-to-image translation methods used in this study and the generation of Korean LP images using LP-GAN are discribed in Section 3. The CNN-based object detector that can perform character segmentation and character recognition simultaneously in a segmentation-free end-to-end manner is discussed in Section 4. Section 5 describes the dataset configured for the experiments, give details on the procedure of generating Korean LP images using the three LP-GANs and using the generated LP images as the training data for the LPCR module, and finally report the performance of the LPCR on the real LP images. Conclusions on this study are given in Section 6.

Image-to-Image Translation
Goodfellow et al. [13] proposed GAN models in which the generators and discriminators that are adversarial toward each other gradually improve each other's performance so that the generators learn to generate data that is as close as possible to the final target data. Mirza et al. [14] suggested an improvement on GAN called Conditional GAN (cGAN). Since GANs generate the output data from random noise input data, control over the output is difficult. However, cGAN can control the output partly by adding conditions to the GAN. Since then, cGAN has been applied in many fields such as image generation [15][16][17][18], image domain transfer [19][20][21][22], image super-resolution [23,24], and image editing [25].
For image-to-image translation problems, Isola et al. [19] introduced cGAN to solve the problem of blurring of the resultant image as a result of pixel-to-pixel translation based on CNN. Zhu et al. [20] proposed an unpaired image-to-image translation model to solve the problem that existing image-to-image translation models need the paired images of input and output as training data. Existing image-to-image translation models can successfully translate images between two domains but scalability and robustness are limited for more than two domains; Choi et al. [21] suggested StarGAN to solve the problem of translating images between multiple domains in a single model.
In this study, LP images were generated using the three state-of-the-art GAN-based image-to-image translation models mentioned earlier. Moreover, it was verified through experiments that the generated LP images are similar to real ones and that they were sufficient for use as training data for the LPCR.
These traditional ALPR systems are segmentation-based algorithms that are heavily influenced by the performance of each stage due to various environmental factors such as distortion, contamination, illumination, and noise. In recent studies, segmentation-free algorithms based on Deep NNs (DNNs) were proposed to overcome these environmental factors [41][42][43].

License Plate Image Generation Via LP-GAN
This section discusses the generation of LP images using the three state-of-the-art GAN-based image-to-image translation methods.

GAN Approaches
Existing CNN-based image-to-image translation methods have a problem in that the resultant image is not photo-realistic because the loss function uses the average of each pixel loss as the total loss. To solve this problem, Isola et al. [19] proposed an image-to-image translation algorithm via pix2pix_cGAN that uses U-Net [44] as a generator network to reduce the loss of information in an encoder-decoder structure. PatchGAN [45] was used in the discriminator network to improve the detail of the resultant image by using loss-per-patch. Pix2pix_cGAN is trained with paired datasets, although in reality, unpaired data is used more often than paired data. Zhu et al. [20] suggested CycleGAN with cyclic consistency to enable the learning of unpaired datasets and to enable image-to-image translation. This approach uses Resnet [46] as a generator network, LSGAN [47] for the loss, and PatchGAN, the same discriminator network used in pix2pix_cGAN.
The existing image-to-image translation methods require k(k-1) generator networks to perform image translation between k multi-domains. To improve this, Choi et al. [21] implemented StarGAN, which enables image-to-image translation between multi-domains with a single generator. Domain classification loss and a reconstruction loss are used in this method for multi-domain image-to-image translation. A target domain label consisting of a binary or one-hot vector is used to specify the target domain to which the input image is translated.

License Plate Image Generation
Currently, LP characters in Korea are composed of seven black characters on a white background, as shown in Figure 2a. The first two digits are the car-code for the type of vehicle, and the third Korean character is the use-code specifying the use of the vehicle. The last four digits are the serial number. Figure 2b shows the character classes present in Korean LP. Numbers are defined as 10 classes from 0 to 9, and Korean characters are defined as 35 classes. In this study, an ID of C1 to C35 was assigned to each of the 35 Korean characters for convenience of expression.
As mentioned in the introduction, consider a situation that there is a small set of real LP images. By searching for 'license plate' through websites such as Google and SNS, we were able to scrape over 5,000 related images, out of which only 159 LP images were actually available. In the collected 159 LP image dataset (Web_159), there are 954 numbers and 159 Korean characters. Figure 3a shows the character class distribution of the Web_159 dataset. As shown in Figure 3a, there the C15 Korean character class is not present in the Web_159 dataset, there was only one incidence each of classes C12, C18, C33, and C34.  To train LP-GAN for generating LP images, paired images were prepared: the target images (i.e., Web_159) and the label images. As shown in Figure 4, the widths of the paired images were resized to 256 pixels and then zero-padded at the top and bottom to normalize them to 256 × 256 pixels. The character string of the label image is the same character in the same position in the target image. Finally, the normalized paired images were input into the proposed LP-GAN as the training data.
For generating the LP images using the LP-GAN generator after training had been completed, 9000 input label images were made in which each LP character had a uniformly random distribution to create the Label_9k dataset. Figure 3b shows the character class distribution in the Label_9k dataset, in which all of them are distributed uniformly, including the C15 character class which is non-existent in the Web_159 dataset. Figure 5 shows the process of generating the 9000 LP by 9000 input label image pairs using LP-GAN. After the input label images of the Label_9k dataset had been converted to zero-padded normalized images, the latter were input into the LP-GAN generator, after which the output images were de-normalized to generate the final LP images.

Segmentation-Free End-to-End LPCR By Object Detector
This section discusses how the end-to-end object detector can be used as a segmentation-free end-to-end LPCR method. Redmon et al. [48] proposed a novel state-of-the-art real-time object detector (YOLOv2) that can detect 9000 different categories. While existing CNN-based object detection models such as Faster R-CNN [49] perform region proposals first and then classify each set of boundary boxes, the YOLOv2 detector considers region proposals and class probabilities as one regression problem and simultaneously performs the location and classification of objects in a single CNN. Using this idea of the YOLOv2 detector, it only needs to detect 45 character classes in the LP images. Accordingly, the whole LPCR process is carried out at once.
In practical ALPR systems, it is necessary to minimize the processing time and the size of required GPU memory in the LPCR stage, because the system must carry out other important processes such as acquiring images, storing recognition results, and communicating with a remote server. In addition, when considering the cost, a system with a powerful but expensive GPU processor may not always be available to run on the ALPR system, so the architecture needs to be as cost-effective and lightweight as possible. To improve these problems, we propose a modified YOLOv2 model with half of the CNN-layer filters in the YOLOv2 detector architecture proposed by Redmon et al. [48] as the LPCR method. The architecture and structure of the proposed LPCR module (i.e., modified YOLOv2) is given in Table 2 and Figure 6, respectively. The modified YOLOv2 model outputs a 13 × 13 × 512 feature map from a 416 × 416 pixel 3-channel image through five steps of convolutional and maxpool layers ( Figure 6B). Next, the 26 × 26 × 256 feature map layers pass through and are reshaped into 13 × 13 × 128 feature map layers, as shown in Figure 6A, and become 13 × 13 × 640 feature map layers by concatenation with the output from Figure 6B. In the last step, 13 × 13 × 250 layers are output for the location and classification of the 45 character classes that is needed to detect (i.e., the prediction of 5 boxes with 5 coordinates each and 45 classes per box = 5 × (5 + 45)). If sufficient amount of training data were provided, the proposed YOLOv2 detector as a segmentation-free end-to-end LPCR method could achieve high performance for detecting 10 number classes and 35 character classes, which was experimentally confirmed.

Experimental Section
In this section, the dataset configuration for the experiments is reported and the experimental results are discussed after describing the implementation details. At the experiments, quantitatively evaluation is performed in order to prove the usefulness of the LP images generated by proposed LP-GAN generators as training data for the LPCR. The experimental results showed that the LPCR trained with the LP images from the three LP-GAN generators outperformed the LPCR trained with the LP images from a single LP-GAN generator.

Web-Scraped Real Images
As mentioned in Section 3.2, 159 real LP images were collected through web-scraping. The Web_159 dataset was used to train the three LP-GAN generators. To compare the LPCR performance when using a small set of data, the Web_159 dataset was also used to train the LPCR module.

Generated Datasets by LP-GAN
To test whether the LP images generated by the LP-GAN generators can be used as training data for the LPCR module of the ALPR system, we prepared several training datasets with various conditions. At first, after the Label_9k dataset was input into the LP-GAN generator trained by pix2pix_cGAN, CycleGAN, or StarGAN, a training LP dataset was obtained for each of them (pix2pix_cGAN_9k, CycleGAN_9k, and StarGAN_9k, respectively). Next, to test the performance of the LPCR according to the number of training data items, another three training datasets were prepared by randomly selecting 3,000 images from each of the pix2pix_cGAN_9k, CycleGAN_9k, and StarGAN_9k datasets (pix2pix_cGAN_3k, CycleGAN_3k, and StarGAN_3k, respectively). Last, to confirm whether the LP images from the three LP-GAN generators together enhanced the performance of the LPCR more than each on its own, two ensemble datasets were prepared from all three sources: the Ensemble_9k dataset was combined with pix2pix_cGAN_3k, CycleGAN_3k, and StarGAN_3k and the Ensemble_3k dataset was combined with 1000 images randomly selected from each of the pix2pix_cGAN_3k, CycleGAN_3k, and StarGAN_3k datasets.

Real Datasets for Comparison and Testing
Twenty-two thousand, one-hundred and seventeen real LP images captured by CCTV at more than 10 different locations were obtained. All of the LP images were labeled with the character information. 9000 of the LP images were randomly selected as the training dataset (Real_9k), and the remaining 13,117 LP images comprised the test dataset (Test_13k). In addition, for the same comparison, 3000 LP images were randomly selected from the Real_9k dataset to configure the Real_3k dataset and then likewise, 159 LP images were randomly selected from the Real_3k dataset as the Real_159 dataset for comparison with the Web_159 dataset. The 13 datasets prepared for the experiments are summarized in Table 3.

Implementation Details
This subsection presents the detailed setup of the system, algorithms, and frameworks used in the experiments. All experiments were performed on PC systems with Intel-i7 CPUs and NVIDIA Titan Xp GPUs. For the three state-of-the-art GAN-based image-to-image translation models (pix2pix_cGAN, CycleGAN, and StarGAN) for generating LP images, the code published by each author was used [50,51]. The LPCR performance experiments were performed using the Darknet framework [48] and using the modified YOLOv2 model in which the number of CNN-layer filters is reduced by half.

LP Generation
The code for pix2pix_cGAN and CycleGAN is written in PyTorch [52] by the same group of researchers. Thus, the network model architecture is different but the training options are nearly identical. The size of the input images was set as 256 pixels, and since the training data was zero-padded and normalized to 256 × 256 pixels, the preprocess option in the code was set to 'none'. Other options (number of iterations, batch size, learning rate, etc.) were set to the default values provided by the author. StarGAN, also written in PyTorch, is capable of multi-domain translation, but in this experiment, only used the translation between two domains, label images, and LP images. As was carried out for the other two models, the size of the input image was set as 256 pixels and the domain dimension was set as 2. Other parameters of StarGAN, such as the number of iterations, batch size, and learning rate, etc. were set to the default values provided by the author.
As mentioned previously, the pix2pix_cGAN_9k, CycleGAN_9k, and StarGAN_9k datasets were generated from the Label_9k dataset using the three LP-GAN generators. Figure 1 shows some of the resultant images of each proposed LP-GAN generator for the same input label image. Figure 1a shows the input label images; Figure 1b-d, reveal the resultant images from each LP-GAN generator; and Figure 1e,f present the real LP images acquired by CCTV and web-scraping, respectively. From the results, it can be seen that the proposed LP-GAN generators could translate in the same style as the real LP while maintaining the character information of the input label images. In particular, even when the input label images contained character information that did not exist in the training data (such as the C15 character class) were input into the generators, the character class was correctly translated exactly the same as the other existing character classes.

LP Recognition
The LPCR modules were trained by using the LP images generated through the proposed LP-GAN generators. Subsequently, the results of the experiments on recognizing the characters of the real LP images show that the LP images generated by the LP-GAN generators were similar to the real LP images and could be used as training data for the LPCR module in the ALPR system. The LPCR modules based on the modified YOLOv2 were trained with an input image size of 416 × 416 pixels, a starting learning rate of 0.00025, a batch size of 64, weight decay of 0.0005, and momentum of 0.9. The training was conducted using four NVIDIA Titan Xp GPUs.

Experimental Results
The 12 LPCR modules were trained using 12 datasets among the 13 datasets configured previously (except for Test_13k). The accuracy of the trained LPCR modules was compared using the Test_13k dataset.
The LPCR performance of each of the seven characters in a Korean LP was compared and then overall performance from complete LPs was compared. Since the character information of the LP is unique, each item of character information in the LP should be correctly recognized. Therefore, the performance of the LPCR module was evaluated based on the accurate recognition of all seven characters rather than the recognition performance of individual characters. The modified YOLOv2 detector proposed in this paper simultaneously performs the location and classification of the object of interest (i.e., the characters in the LP), but the LPCR module obtains the same result if the character classification is correct, even if the location of the character is incorrect. Hence, the experiments in this study did not consider the locational accuracy of the characters. Table 4 reports the performance comparison results of the LPCR modules trained with the 12 different training datasets. All characters except for the third Korean character are number classes, and the recognition performance of numbers was over 99% with almost all of the comparison datasets. Since there are only 10 classes for numbers and the distinction between the types is obvious, the performance was high regardless of the dataset. In the case of the third Korean character, there are 35 classes, which makes identifying them more complex than for numbers, so there was a difference in performance per dataset. Therefore, the overall performance of LPCR strongly depends on the recognition performance for the Korean characters.
The accuracies of the Real_9k, Real_3k, and Real_159 datasets with the proposed modified YOLOv2 detector were 99.78%, 99.72%, and 97.85%, respectively, reflecting high LPCR performance because the LPCR modules had been trained with real LP images and so more easily recognized. This means that the proposed modified YOLOv2 detector gave the LPCR sufficient recognition performance for use in the ALPR system. Provided that there are enough LP images to train the LPCR module, a high-performance LPCR module for the ALPR system could be developed with more than 99.7% overall accuracy.
The comparison results of the LPCR performance of the six LPCR modules trained with the six LP datasets generated by the three proposed LP-GAN generators are shown in Figure 7. As can be seen, the performance of the LPCR module trained with the LP images generated by the pix2pix_cGAN-based LP-GAN was higher than the other two LPCR modules trained with CycleGAN or StarGAN. This means that among the three GAN-based image-to-image translation methods, pix2pix_cGAN generated LP images that were more realistic than those from the other two methods. For the LP images generated via a single LP-GAN generator, the more items in the training data, the higher the recognition performance.  As the last thing to focus on, the overall recognition performance with the Ensemble_3k dataset was 95.56%, which is lower than 96.33% for pix2pix_cGAN_9k but higher than 93.59% for CycleGAN_9k and 94.23% for StarGAN_9k. Despite the number of training data items being three times higher, higher recognition performance was achieved when training with the 3000 training data items combining the LP images generated from multiple LP-GANs than training with a total of 9000 training data items generated from the single LP-GANs. The overall recognition performance was 98.72% for the Ensemble_9k dataset with the 9000 combined data items, which is almost the same as when trained with real images. This shows that higher performance LPCR modules can be trained with LP images generated by multiple LP-GAN generators. Moreover, it is shown that the proposed LP-GAN can be fully used as a training data generator for the LPCR module of an ALPR system. Figure 7 shows the performance comparison of the ensemble datasets and the other single datasets. Table 5 shows a comparison of the original YOLOv2 model and the proposed YOLOv2 model in terms of overall accuracy, average processing time, required GPU memory and number of floating-point operations (FLOPs) for the LPCR processes. Each model was trained with the same training dataset (Real_9k) and tested with the same test dataset (Test_13k). The original YOLOv2 model and the proposed YOLOv2 model were achieved 99.95%, 99.78% in overall accuracy, respectively. The proposed model is slightly less accurate than the original model. However, the proposed YOLOv2 model was modified the number of filters in the CNN layers, thereby the number of FLOPs was reduced from 29.41 billion to 7.45 billion, and the processing time and the size of required GPU memory were also reduced in half (i.e., reducing the average processing time from 22 ms to 13 ms and the size of required GPU memory from 1006 MB to 474 MB).  Figure 8 shows the results of the LPCR with the modified YOLOv2 model. It is difficult for LPCR using traditional segmentation-based ALPR methods to recognize the LP images because the performance was insufficient during the LP character segmentation stage due to distortion, contamination, illumination, and noise. However, since the modified YOLOv2 model detects the LP characters end-to-end without LP character segmentation (i.e., segmentation-free), its LPCR performance on the LP images was sufficient. Nevertheless, LPCR could not recognize the LP character information for some other LP images. Figure 9 shows some failure recognition resultants by the proposed YOLOv2 model. Although the proposed YOLOv2 model is robust to various environmental weaknesses, if the LP image is severely distorted due to excessive blurring or artificial manipulation, LPCR performance is degraded.

Conclusions
In this paper, we presented an LP image generator based on state-of-the-art GAN-based image-to-image translation methods to generate synthetic LP images using small set of real LP images for end-to-end LPCR module training. Our proposed LP-GAN generates LP images that are similar to the real ones using only the 159 real LP images available online. The generated synthetic images were used as the training data for the LPCR module with achieving 98.72% overall accuracy. Furthermore, the proposed method can be applied to generating of other countries' LP images as well as Korean ones. In addition, we presented the modified YOLOv2 model for an efficient LPCR module that performs character segmentation and recognition simultaneously in a segmentation-free end-to-end manner. Our proposed model was sped up 1.7 times faster than the original YOLOv2 model and the size of required GPU memory was also reduced in half.

Conflicts of Interest:
The authors declare no conflict of interest.