Intelligent Image Super‑Resolution for Vehicle License Plate in Surveillance Applications

: Vehicle license plate images are often low resolution and blurry because of the large dis‑ tance and relative motion between the vision sensor and vehicle, making license plate identification arduous. The extensive use of expensive, high‑quality vision sensors is uneconomical in most cases; thus, images are initially captured and then translated from low resolution to high resolution. For this purpose, several traditional techniques such as bilinear, bicubic, super‑resolution convolutional neural network, and super‑resolution generative adversarial network (SRGAN) have been devel‑ oped over time to upgrade low‑quality images. However, most studies in this area pertain to the conversion of low‑resolution images to super‑resolution images, and little attention has been paid to motion de‑blurring. This work extends SRGAN by adding an intelligent motion‑deblurring method (termed SRGAN‑LP), which helps to enhance the image resolution and remove motion blur from the given images. A comprehensive and new domain‑specific dataset was developed to achieve im‑ proved results. Moreover, maintaining higher quantitative and qualitative results in comparison to the ground truth images, this study upscales the provided low‑resolution image four times and removes the motion blur to a reasonable extent, making it suitable for surveillance applications.


Introduction
Navigant research [1] suggests that the number of vehicles in the world will grow to two billion by 2035.This huge increase in the number of vehicles poses a significant challenge to humans in managing them manually.In this regard, smart cities require a significant focus to manage the flow of vehicles intelligently [2].Different vision sensors, position identification sensors, and many more applications are used to ensure the concept of vehicles communicating autonomously, that is, the flow of traffic or smart parking management.Several identification tags are used to achieve this; however, vehicle license plates are the most traditional and unique elements used for the correct identification of the vehicle type and model year.
Vehicles are uniquely identified based on an important component known as the license plate.Finding a stolen car, tracking a trouble-making vehicle, smart parking management, and automatic toll collection all use vehicle license plates to perform these tasks.For the smooth execution of such tasks, the correct identification of the vehicle license plate is indispensable.However, in some cases, the captured license plate images develop some sort of perturbation, owing to low lightning conditions, low resolution, and motion blur, making this process difficult.Therefore, several image super resolution (SR) techniques have been developed over time to overcome these challenges.
Image SR is a technique used to reconstruct high resolution (HR) images based on provided low resolution (LR) counterparts.The application domain of super resolution (SR) is vast, and it can be used in remote sensing [3], hyperspectral SR [4,5], and medical imaging [6][7][8].However, in some cases the images acquired by different imaging devices such as surveillance cameras, cell phones, X-rays, MRI, and CT-scans are of low-resolution.These images are mostly blurred and contain noise due to relative motion, lamination variation, distance variation, and low-quality imaging devices.Applications such as restoration [9], surveillance, and medical imaging systems [10,11] require HR images for recognition and diagnosis, respectively.Although some applications such as Blu-ray movies, video conferencing, and web videos are often in HR, to preserve server storage and bandwidth, they are often stored in LR.
To transform an LR image into an HR image, several techniques are available, that can be classified into two broad categories: traditional image processing techniques and convolutional neural networks (CNNs)-based SR algorithms [12].Traditional methods, such as bi-linear and bi-cubic methods, are computationally inexpensive and easy to deploy; however, these methods have a few limitations that make them inefficient to deploy in certain circumstances.One of the basic limitations of these methods is that they generate overly smooth textures in reconstructed images.In addition, these methods typically fail to reconstruct the original content of an image.However, the modern techniques available are usually based on deep learning techniques, specifically CNNs.These techniques iteratively enhance image quality by minimizing the loss between the original image and the reconstructed image.Numerous optimization techniques are available to help CNN models reduce the loss between the original image and the reconstructed image.
The proposed super-resolution generative adversarial network for license plates (SRGAN-LP) was based on one of the most promising techniques for image SR resolution, known as a super-resolution generative adversarial network (SRGAN) [13].The original architecture of SRGAN utilizes three models, that is, a deep generator, a discriminator consisting of several residual blocks, and a novel function called perceptual loss, for realistic image reconstruction.However, our solution is largely based on the identification of digits on the license plate of a vehicle, rather than realistic image generation.Therefore, we reduced the size of the actual SRGAN generator to a minimum, to reduce computational cost.In addition, we incorporated the motion deblurring method into the original SRGAN so that the digits and letters were correctly identified.Our extensive experimental results justify the changes to the original architecture.
Our proposed SRGAN-LP method is compared with traditional techniques such as the bilinear, bicubic, and single image-based super resolution method SRCNN [8].The experimental results show the promising performance of our method.Similarly, the results were compared with the SRGAN trained on the ImageNet dataset.To justify the effectiveness of the SRGAN-LP, we conducted comprehensive experiments on two different testing sets.First, we used the same testing set of images as the training images, and in the second phase, we performed experiments on independent images, that is, images independently collected from vehicles.Considering all these experiments and comparisons, we concluded our contributions to vehicle license plate image SR as follows:

•
In light of the usefulness of SRGAN in the current literature, we incorporated motion deblurring in its architecture, thus achieving good-quality HR and deblurred images.

•
We reduced the size of the original SRGAN by reducing the number of residual blocks in the generator network from 16 to 8, consequently achieving less inference time while preserving the same performance.

•
We developed a comprehensive and new domain-specific dataset that originally contains 3112 images of different regions and color patterns.Furthermore, we diversified the angles of the images and increased the size of the dataset to 12,388 using different augmentation techniques.
The remainder of this paper is organized as follows.Sections 2 and 3 present related work and the proposed methodology, respectively.The experimental results and evaluations are presented in Section 4. Section 5 concludes the paper with a discussion of future work.

Related Work
As image SR and deblurring are applied to tackle various challenges in real-world scenarios, the related work is divided into two parts, where 2.1 focuses on the topic of super resolution and deblurring, and 2.2 specifies existing literature related to intelligent vehicle license plate recognition.

Image Super Resolution and Deblurring
Image SR and deblurring [14] has remained a hot research area among the computer vision research community.Earlier approaches relied on pure image processing techniques by applying sharpening filters followed by interpolation-based methods, such as bicubic and bilinear interpolations [15].These methods have remained benchmarks for a reasonable period of time, however, they exhibit a persistent problem of generating overly smooth textures in the reconstructed images.With the emergence of CNNs, and their promising results in other fields, researchers have applied them in the SR domain as well.In this regard, a breakthrough approach, SRCNN [8], applied convolutional layers to enhance an LR image, and the results were very impressive when it was first published.Succeeding SRCNN, a very deep convolutional network named VDSR was proposed in [16], where 16 convolutional layers were added with the implementation of residual learning.The output of the VDSR produced a better result than the one in the SRCNN.Both SRCNN and VDSR aimed to increase the peak signal to noise ratio (PSNR) between the recovered image SR image and the HR image, by reducing the mean square error (MSE) between the SR and HR images.Although CNN-based methods performed much higher than the traditional methods, however, with the invention of generative adversarial networks (GAN), and its incredible results, the domain of SR largely shifted to GAN-based methods.
The idea of GAN was first coined by Goodfellow et al. [17], who trained a generative model and discriminative model simultaneously through an adversarial process.Based on GAN, Ledig et al. proposed a method called SRGAN [13].The SRGAN framework is capable of inferring photorealistic natural images for 4x up-scaling factors.A new methodology using GAN was proposed by Mao et al. [18], generating least squares GANs (LSGANs) in which the least squares loss function is calculated for the discriminator, and it was found that LSGANs can generate higher quality images than regular GANs.In addition, LS-GANs remain more stable during the learning process than regular GANs.Lim et al. [19] developed an enhanced deep SR network called EDSR.This improvement was achieved by removing unnecessary modules from the conventional residual networks.They found that the proposed EDSR was more optimized in terms of generating SR images than the original GAN.A novel approach for synthesizing HR photorealistic images from semantic label maps, using conditional generative adversarial networks (conditional GANs), was proposed by Wang et al. [20], which generated 2048 × 1024 visually appealing results with a novel adversarial loss along with new multi-scale generator and discriminator architectures.
However, less attention has been paid to deblurring in the SR arena.The former works relied primarily on using Laplacian filters for sharpening; however, sharpening the image alone does not usually guarantee the reconstruction of the original content of the image.A comparatively recent work by Kupyn et al. [21] proposed the idea of DeblurGAN, which is an end-to-end learned method for motion deblurring.The DeblurGAN training process involves conditional GAN and content loss.They showed that the proposed DeblurGAN was five times faster than the DeepDeblur [22] model in terms of the structural similarity measure and visual appearance.Similarly, Nah et al. [23] presented an averaging-based technique; however, it lacks generalization capability owing to the lack of diversity in datasets generated using averaging.

License Plate Super Resolution and Deblurring
There are two different methods for license plate recognition (LPR): segmentationbased [24] and non-segmentation-based [25].Segmentation-based techniques mainly trace back to the traditional machine learning techniques, whereas non-segmentation-based techniques largely subsume recent deep learning-based approaches, including CNNs, for the identification or reconstruction of license plate images.Segmentation-based methods first divide the license plate into segments of characters, which are then recognized using a projection-based classifier [26] and connected-component-based classifiers [27].In contrast, a non-segmentation-based method was first proposed by Shi et al. [28], where a deep CNN was applied for feature extraction directly without a sliding window, and a bidirectional long short-term memory network was used for sequence labeling.The literature reveals that non-segmentation-based methods are promising for license plate imagequality enhancement.

Proposed Methodology
The goal of image SR is to obtain an HR image from the provided LR image, as shown in Figure 1, depicting the proposed SRGAN-LP.Our aim is to train a generator that predicts a high-resolution I H image from the provided low-resolution I L image with minimum loss.To perform this process, we construct a generator network G, which is a deep CNN model with parameter θ G .For all training images N, we optimize θ G as given in Equation (1).A visual overview of the generator network G is depicted in Figure 2 and details of the input and output parameters of the proposed method are given in Table 1.
The I L images are obtained by down sampling using a bicubic kernel with a factor size of δ = 4, and applying motion blur β to the I H images with the number of channels C as shown in Equation (3).After the images are converted into I L , they are input to G.

Adversarial Loss Function
In adversarial loss, the variable I H is an input to the discriminator (D), which is used to compare the I S and I H images to discriminate between them.For this reason, I H images are also given to D from the dataset.
The aim of D is to calculate the adversarial loss.Equation ( 4) represents D and shows how it is parameterized by θ D : In Equation ( 4), the adversarial loss is calculated, which later contributes to the perceptual loss calculation.The architecture of the discriminator D network is illustrated in Figure 3.  Equation ( 5) is used to calculate the adversarial loss in terms of the probabilities returned by D. This adversarial loss is then combined with another loss to obtain the final objective function of SRGAN. (5)

Content Loss Function
VGG19 was used to calculate the pixel-wise MSE as a content loss in this architecture.The content loss used in the MSE is the pixel-wise difference between the generated image I S and the original high-resolution image I H of the dataset, which can be calculated using Equation (6).

Perceptual Loss Function
Perceptual loss is a weighted combination of content loss and adversarial loss, which tends to reconstruct the original content of an image.Previously, SR problems were commonly based on the MSE loss function; however, in the proposed SRGAN-LP, MSE combined with adversarial loss was used to push the G to reconstruct the original content of the image.
ℓ S = l S X content−loss The weighted sum of content loss (l S X ) and an adversarial loss component are used according to Equation (7).After the perceptual loss calculation, backpropagation occurs and optimizes the G network to learn the distribution more efficiently.The process shown in Figure 1 continues until the G network starts generating images that are more realistic and have recognizable digits.

Results and Discussion
We conducted extensive experimentation and testing to evaluate the performance of the proposed SRGAN-LP using various evaluation techniques.For this purpose, we collected a large-scale dataset as discussed in Section 4.1.Similarly, Section 4.2 briefly describes the experimental setup, followed by a comprehensive evaluation of the results in Section 4.3.

Dataset Acquisition
Data works as the fuel for deep learning models; however, collecting large amounts of vehicle license plate data with a uniform spatial resolution and almost the same lightning conditions is a challenging task.For this purpose, we accessed a license plate repository [11], and downloaded 3700 images with various backgrounds and digit colors as a raw dataset.To increase the number of images and diversify the angle of images, we used the data augmentation library "Augmentor" [12].Using "Augmentor" we incorporated diversity into the images by changing their angles with the standard techniques such as tilt, skew, and rotate.Subsequently, we synthesized a dataset of 12,388 HR images from this raw dataset.The model was trained on the HR images using a standard spatial resolution of 256 × 256 pixels.We maintained a scale factor (δ) of four for the training.For testing purposes, we segregated 100 images from the synthesized dataset (synthetic test set) and downloaded another set of 100 images from Google, which is referred to as the real test set in this section.

Experimental Setup
We trained our proposed SRGAN-LP network on an NVIDIA GTX 1070 GPU with 12 GB memory and 24 GB RAM.For training G, we obtained low-resolution I L images by applying a motion blur β of size 16 and a down sampling factor δ of four, thus reducing the sizes of the images from 256 × 256 to 64 × 64.For D, we used the original high-resolution images.Our generator network consists of eight identical residual blocks Λ and two transposed convolution layers.We used the Adam optimization algorithm [29] for our network.For G and D, we maintained learning rates of 10 −5 and 10 −6 , respectively.To train a composite model, that is, SRGAN-LP, we used a learning rate of 10 −3 .We used the deep learning library Keras [13], with TensorFlow [14] as the backend for the implementation of this network.

Performance Evaluation
Qualitative evaluation often involves human ratings, whereas quantitative evaluation comprises standard evaluation metrics in image processing, such as the PSNR and the structural similarity index metric (SSIM) [30].In addition to qualitative and quantitative evaluations, the proposed SRGAN-LP was analyzed using optical character recognition (OCR) results.

Quantitative Evaluation
We conducted a quantitative evaluation for both of our test sets, that is, synthetic and real test sets, using the PSNR and SSIM [23,31].
Equation ( 8) shows the formula for calculating the PSNR between the original and the reconstructed images.f is the original image and g is the reconstructed image obtained using a certain technique.A higher PSNR value indicates better results for SR.
Similarly, Equation ( 9) represents the SSIM between the original and reconstructed images.The SSIM is the product of the differences in luminance l, contrast c, and structural similarity s between the original and generated images.The range of the SSIM is from 0 to 1, and a score closer to 1 is considered the best in the case of SR.
Table 2 shows the average PSNR and SSIM values for the results of the evaluation conducted on the synthetic test set.The higher values of PSNR and SSIM show the effectiveness of the proposed SRGAN-LP on the synthetic test set.Table 3 shows the average PSNR and SSIM scores for the results of the evaluation conducted on the real test set.The above tables reveal the effectiveness of the proposed method in comparison to baseline techniques such as bilinear, bicubic, and SRCNN [32].Moreover, the results are also compared with SRGAN trained on the ImageNet dataset.For further assessment, qualitative results are discussed in the next subsection.4a,b, the size of the bubbles represents the number of parameters of the models.In the experiments, it was evident that the models with a larger parameter space had a higher inference time, whereas the models with a smaller number of parameters had a lower inference time.However, traditional models, without parameters, have remarkably low inference times.Low-parameter models with higher reconstruction scores can be deployed easily on resource-constrained devices.

Evaluation Using Inference Time
Deep learning inference time is the time consumed by a deep learning model for a single prediction.In the context of image reconstruction, that is the time required for a model to reconstruct a new image.The inference time depends on the number of model parameters.In Figure 4a,b, the size of the bubbles represents the number of parameters of the models.In the experiments, it was evident that the models with a larger parameter space had a higher inference time, whereas the models with a smaller number of parameters had a lower inference time.However, traditional models, without parameters, have remarkably low inference times.Low-parameter models with higher reconstruction scores can be deployed easily on resource-constrained devices.

Qualitative Evaluation
In contrast to quantitative image assessment, which tends to assess image quality more technically, qualitative analysis involves human expertise.In qualitative analysis, the reconstructed images are subjected to human raters, who provide their opinions and assess the quality of the reconstructed images.The MOS score is one of the most widely used techniques for qualitative image assessments.
Figure 5 shows the visual quality of the reconstructed images of the synthetic test set.These images were presented to human raters to identify the quality of the images and, more importantly, to identify the digits that were highlighted using the yellow bounding box.
Similarly, Figures 6 and 7 show the visual quality of the reconstructed images of the real test set.The same images of the real test set were presented to the human raters, and their opinions on the quality of the reconstructed images are presented in Figure 4.

Qualitative Evaluation
In contrast to quantitative image assessment, which tends to assess image quality more technically, qualitative analysis involves human expertise.In qualitative analysis, the reconstructed images are subjected to human raters, who provide their opinions and assess the quality of the reconstructed images.The MOS score is one of the most widely used techniques for qualitative image assessments.
Figure 5 shows the visual quality of the reconstructed images of the synthetic test set.These images were presented to human raters to identify the quality of the images and, more importantly, to identify the digits that were highlighted using the yellow bounding box.Similarly, Figures 6 and 7 show the visual quality of the reconstructed images of the real test set.The same images of the real test set were presented to the human raters, and their opinions on the quality of the reconstructed images are presented in Figure 4.

Evaluation Using OCR
To further consolidate our experimental results, we subjected both testing sets (synthetic and real) to OCR.For this purpose, we used an OCR "platerecognizer", publicly available in [33].The accuracy of OCR is based on the number of characters present in an image versus the number of correctly recognized images.For instance, we have an image in the test set originally consisting of the characters GX6933; however, the OCR predicts different values, such as GX693, as shown in Figure 8.This misrecognition negatively contributes to the average accuracy of the OCR in recognizing license plate digits.The formula devised for calculating accuracy is as follows:  Equation (10) represents the global error calculation rate of OCR for a given image.In the Equation, n e is the number of errors committed and n c is the number of all characters present in an image.Using this Equation, we present the following table to illustrate the performance of the OCR on different reconstruction techniques used in the experiments.
Table 4 shows the average accuracy of all the images present in both the test sets.The OCR's accuracy depends significantly on the image's position.Both test sets contained images that posed difficulties to the OCR in achieving accurate character recognition.However, the higher accuracy of the proposed method compared with the other methods verifies the effectiveness of our work, demonstrating its superiority.

Conclusions
This study aimed to enhance the quality of images of vehicle license plates by increasing the resolution and removing motion blurriness.Manual management of vehicles in a smart surveillance environment is an arduous task.Therefore, more attention has been paid to the intelligent management of vehicles in smart surveillance environments.In this regard, license plates are considered unique identification tags for vehicles.However, owing to the high-speed motion of vehicles, motion blur is the most common phenomenon occurring in surveillance environments.To tackle this challenge, we proposed SRGAN-LP, which intelligently performs deblurring as compared to other methods.Extensive experimental results indicate that the proposed method outperforms the existing methods in terms of achieving a high-resolution deblurred image.
The results obtained by the proposed method were better both in terms of qualitative and quantitative evaluations compared to the existing methods.However, the inference time was relatively high.Achieving real-time performance, by reducing the inference time of the proposed SRGAN-LP, is suggested as future work.In addition, the evaluation can be extended using other metrics, and more modules can be added to the system, such as vehicle recognition [34], vehicle logo recognition [35], and make/model recognition [36], for better vehicular analysis.

Figure 1 .
Figure 1.Overview of the proposed methodology.I H and I L are acquired from the (a) Surveillance environment.In the (b) Training process, the generator receives I L and removes blur and upscales the I L to I S .The discriminator and VGG-19 calculate adversarial loss and content loss between the original I H and generated I S .Finally, both losses are added proportionately to form perceptual loss l S .Subsequently, l S updates the generator weights.For (c) Testing process, I L is directly inputted to the generator trained in the training process and I S is acquired as the resultant HR image.

Figure 2 .
Figure 2. Architecture of the generator network."k" represents filter size, "n" is the number of filters, and "s" is the stride value used in a particular layer.Two types of convolutional blocks are used in the generator.A "residual block" is used for feature extraction, whereas the "UpSampling" block is used for converting an image from I S to I H . Equation (1) is used to convert I L to its I H counterpart.Similarly, in Equation (2), I H represents the HR image.W H and H H represent the width and height of the HR images, respectively.Similarly, C (BGR, C = 3) represents the number of channels in the image.HR images were available only during training.I H images are converted to LR I L images by applying motion blur (β = 16) and a down sampling operation with a specified scale of δ.

Figure 3 .
Figure 3. Architecture of the discriminator network."k" represents filter size, "n" is the number of filters, and "s" is the stride value used in a particular layer.Two types of convolutional blocks are used in the discriminator.One type of convolutional block comprises a convolution, batch normalization, and leaky ReLU layers.The other type of convolutional block lacks the batch normalization layer.

Figure 4 .
Figure 4. (a) Represents inference time of each reconstruction method on x-axis and mean opinion score (MOS) on y-axis for real-test set.Similarly, (b) illustrates the same for the synthetic test set.The results shown in both (a) and (b) suggest that the proposed SRGAN-LP is the most convincing to the human raters, achieving 4.5 MOS on the real test set and 5.0 on the synthetic test set.

Figure 4 .
Figure 4. (a) Represents inference time of each reconstruction method on x-axis and mean opinion score (MOS) on y-axis for real-test set.Similarly, (b) illustrates the same for the synthetic test set.The results shown in both (a) and (b) suggest that the proposed SRGAN-LP is the most convincing to the human raters, achieving 4.5 MOS on the real test set and 5.0 on the synthetic test set.

Figure 5 .
Figure 5. Visual quality along with corresponding PSNR/SSIM of the reconstructed images using different techniques used in the experiments.

Figure 6 .
Figure 6.(a) Ground truth image (original image), (b) Motion blur applied on whole image (distorted image), (c) Visual quality of the reconstructed images along with the corresponding PSNR/SSIM scores.

Figure 7 .
Figure 7. (a) Ground truth image (original image), (b) Motion blur applied on whole image (distorted image), (c) Visual quality of the reconstructed images along with the corresponding PSNR/SSIM scores.

Figure 8 .
Figure 8.(a) Illustrates the visual results of OCR for the real test set, whereas (b) visualizes the results for the synthetic test set.Similarly, the green and red dots indicate the correct and incorrect predictions respectively.

Table 1 .
Description of input and output parameters used in the proposed model.

Table 2 .
Average PSNR (dB) and SSIM of 100 reconstructed test images for the synthetic test set.Bold scores show the best results, and underlined values represent the second best results.

Table 3 .
Average PSNR (dB) and SSIM of 100 reconstructed test images for the real test set.Bold scores show the best results and underlined values represent the second best results.Inference time is the time consumed by a deep learning model for a single prediction.In the context of image reconstruction, that is the time required for a model to reconstruct a new image.The inference time depends on the number of model parameters.In Figure

Table 4 .
Average accuracy calculated for synthetic test set and real test set.