A Deep Learning Based Approach for Localization and Recognition of Pakistani Vehicle License Plates

License plate localization is the process of finding the license plate area and drawing a bounding box around it, while recognition is the process of identifying the text within the bounding box. The current state-of-the-art license plate localization and recognition approaches require license plates of standard size, style, fonts, and colors. Unfortunately, in Pakistan, license plates are non-standard and vary in terms of the characteristics mentioned above. This paper presents a deep-learning-based approach to localize and recognize Pakistani license plates with non-uniform and non-standardized sizes, fonts, and styles. We developed a new Pakistani license plate dataset (PLPD) to train and evaluate the proposed model. We conducted extensive experiments to compare the accuracy of the proposed approach with existing techniques. The results show that the proposed method outperformed the other methods to localize and recognize non-standard license plates.


Introduction
Automatic license plate recognition has widespread applications in overcoming traffic violations, parking offenses, and better decision making for e-ticketing of vehicles [1]. A license plate is a crucial token issued by the state authority for vehicle identification and record keeping. Traffic wardens, tax collectors and other stockholders use license plates to monitor traffic and keep records accordingly. Modern traffic management systems rely heavily on automatic monitoring systems based on computer vision and machine learning techniques.
These typically require a license plate of standard size, color, font, style, and fixed location for automated localization and recognition. Unfortunately, the license plates currently used in Pakistan do not conform to standard features. In addition to a license plate, people also use different text to mention their profession, tribe, political affiliation, etc. This kind of text looks similar to license plates, which makes localization and recognition more difficult.
For illustration, a few snaps of Pakistani license plates are shown in Figure 1. Note that the license plate in Figure 1a divides into two colors vertically: green and white. The green part contains a monogram and the text "PUNJAB". Similarly, the text in the white part of the license plate consists of two lines in which the first line has different font sizes separated by a hyphen.
The license plates given in Figure 1b,c consist of three lines of text with gray and white backgrounds, respectively. The middle line in Figure 1b contains a dot that separates the digits, while the middle line in Figure 1c starts with two English alphabet letters, followed by a small monogram, and then four digits. If we closely observe the rest of Figure 1d-h, similar variations are visible. In such situations, the traditional license plate localization and recognition techniques fail to work [2]. The recent advancements in deep learning outperformed the traditional techniques in many fields of computer vision, such as object detection [3,4], segmentation [5][6][7][8], tracking [9,10], image understanding [11,12], image classification [13][14][15], medical diagnosis [16,17], and more [18,19]. A deep model requires a large amount of data for training (to extract out rich features and find the hidden nonlinear relationship among different entities). The successful implementation of deep learning in many fields of computer vision has inspired this research. This paper presents a deep architecture to localize and recognize Pakistani license plates. Our model correctly localizes and recognizes the license plate when there are other texts and handles the color, illumination, size, style, and font variations. The main contributions of this work are: The rest of the paper is organized as follows: the related work is discussed in Section 2, while Sections 3-5 present the proposed model, experimental details, and conclusion, respectively.

Related Work
License plate localization and recognition is considered to be a two-step process. Localization means finding a region in an image containing the license plate, while recognition means identifying the text written in it. Both the localization and recognition require feature extraction followed by classification. Based on feature extraction, the license plate localization and recognition methods can be divided into two classes, (1) hands-on feature engineering and (2) automated feature engineering. The hands-on feature engineering methods use core computer vision approaches to explicitly extract the features. On the other hand, automated feature engineering uses machine learning approaches to implicitly learn the features.
Hands-on feature engineering methods: Notable works on Pakistani license plate recognition based on hands-on feature engineering include Malik et al. [20] who used connected component analysis (CCA) to localize and recognize standard number plates of the Punjab province, which contain an inherent green region. They used the ratio between different color channels to locate the number plate area. For recognition purposes, template matching was used. Illumination changes, dust, and weather affect the color ratios and, thus, lead to imperfect detection. Similarly, the aspect ratio between green and white regions is exploited in HSV color space, to localize Punjab's standard license plates [21].
In many works in this category [22][23][24], the Sobel edge detector is used to localize license plate and segment characters followed by template matching for recognition. The Sobel edge detector is very sensitive to noise because the first derivative leads to poor detection. Moreover, the Sobel detector is a scaled variant and fails to detect license plates of different sizes.
The histogram of vertical and horizontal edges was also employed for license plate localization [25]. A predefined threshold was used to analyze the license plate histogram. However, this approach lacks scale variations and distinguishes the license plate and other text phrases if they exist. Rasheed et al. [26] used Hough transformation (HT) to detect vertical and horizontal edges for localization. The Hough transform needs to manually specify a threshold to detect a line of a specific length. Moreover, this approach looks for rectangular regions and fails to distinguish a number plate and other rectangular regions.
Samra et al. [27] applied morphological operations followed by connected component analysis to get license plate proposals. The final license plate region was selected based on enclosed objects using a genetic algorithm (GA). In addition to derivatives and thresholding (prone to noise), it considers number plates as a sequence of fixed length and limits the application to standard license plates only.
In [28], the vertical edges were detected followed by AdaBoost to select the coarser level character-specific extremal regions (ER) for detection. Finally, the histogram of oriented gradient (HoG) was extracted, and the recognition was performed through hybrid discriminative restricted Boltzmann machines (HDRBMs). This approach considers a fixed aspect ratio between the height and width and restricts the application to Chinese standard number plates. Bhutta et al. [29] assumed a region as a license plate where characters lie on a straight line. They perform recognition using an SVM classifier.
Automated feature engineering methods: These methods use deep learning to implicitly learn the rich features. The deep-learning-based approaches for license plate recognition include [30,31], which generates license plate region proposals and performs the final selection using a CNN as a binary classifier. Similarly, the CNN is trained for the entire character sequence to detect and recognize Malaysian license plates [32].
The techniques above fail to distinguish license plates and other general alphanumeric text if they exist. Moreover, the fixed-width bounding box restricts the application to standard license plates. Zang et al. [33] applied a visual attention model to detect license plates containing blue and yellow regions. The final classification was performed with SVM. This approach is limited to certain types of number plates and weak to variations in illumination and scale.
Jain et al. [34] generated license plate proposals with a vertical Sobel filter and applied binary CNN for final verification. However, this failed to tackle the actual negative cases caused by Sobel due to noise. Laroca et al. [35] used two CNN models to detect vehicles and localize license plates, respectively. This method considers the number plate as a sequence of fixed length (of seven characters) and limits the application to standard license plates (of Brazil) only.
The work presented in [36] detects color edges followed by morphological analysis to extract the license plate. CCA and PA (projection analysis) with fixed height and width are used to segment characters, which is followed by recognition with CNN. Zhuang et al. [37] used semantic segmentation and counting refinement for recognition. This method works for fixed-length number plates and fails for varying lengths.
The methods discussed thus far have addressed license plate recognition for standard designs/sizes. However, the task of recognition for non-standard number plates remains a crucial gap. Hence, in this work, we propose a novel model to address the task of number plate recognition for varying sizes, fonts, styles, and designs.

Proposed Model
The proposed model consists of three modules: localization, rectification, and recognition. Figure 2 presents the block diagram of the proposed model. The localization module is responsible for finding the license plate area and drawing a bounding box around it. Typically, the license plates tilt or shear in a particular direction due to the vehicle motion or camera orientation. The curved and tilted nature of the text drops the recognition accuracy. For robust and accurate recognition, the text is rectified. Finally, the rectified image is passed to the recognition module, which converts the image text to an editable text. The following subsections discuss the proposed model in detail.

License Plate Localization
YOLO [38] is a fully convolutional neural network that contains layers with downsampling and upsampling and skips connections. It takes a RGB image I ∈ 416×416×3 and generates an output map O ∈ 13×13×| f v | , where | f v | is the length of the feature vector f v predicted by the network, and f v is given by [38] The given feature vector contains the bounding box information, where (t x , t y ) is the midpoint of the bounding box relative to the cell, while t w , t h , and t o represent the width, height, and type (class) of the bounding box, respectively. Note that the bounding boxes are divided into two classes: license plate and non-license plate. For license plates, t o = 1, and vice versa for non-license plates. The cell is offset by (c x , c y ), and then the coordinates of the bounding box relative to the top left corner of the image are given by [38]: where (b x , b y ), b h , and b w represent the mid-point, height, and width of the resultant bounding box, respectively. p h and p w represent the height and width of the anchor box, respectively.

License Plate Rectification
As mentioned before, Pakistani license palates are non-standard and vary in color, font, style, and orientation. Moreover, license plate images may contain distortion, which further complicates the recognition. Thus, rectification is essential to increase recognition accuracy. The rectification may be performed with affine transformation, but it cannot handle variations in scale, rotation, and translation due to geometric constraints. Therefore, we used a trainable and constraint-free model; the multi-object rectified network (MORN) [39]. This uses deformable convolutional kernels to extract distinctive rich features that predict the offsets and rectify the text. Table 1 provides the model summary [39].

Layer Type
Hyper-Parameters Size MORN applies ReLU to the batch normalized output of each convolutional layer except the last. It splits the image into different parts and then estimates the offset of each part. The tanh(·) is used to return the offset values in the range of (−1, 1), representing the relative position to the original position. The relative offset map is resized to the size of the input using bilinear interpolation.
A basic grid in the range [−1, 1] is generated from the input image to remember the positions of the original pixels. Note that the coordinates of the top-left and bottom-right pixels are denoted by (−1, −1) and (1, 1), respectively. Let α represents the relative offset map, and β represents the basic grid then the new offset mapᾱ is given by: where c represents the channel and (i, j) represent the respective coordinates. The offset mapᾱ(i, j, c) is mapped to the size of image such that i ∈ [0, W] and j ∈ [0, H], where H and W represent the height and width of an image I. Thus, the pixel value at (i, j) of the rectified imageĪ is given by:Ī (i, j) = I(ᾱ(1, i, j),ᾱ(2, i, j)).
Note thatᾱ(1, i, j) andᾱ(2, i, j) sample the first and second coordinates from the first and second channels of the offset map.

License Plate Recognition
Recognition is the process to convert an image-text to editable text. For recognition, we used an attention sequence-to-sequence network [40], which takes the rectified image as an input and predicts the license plate character sequence. The detailed architecture of the recognition network is given in Table 2. The presented model consists of an encoder and a decoder.
The encoder module uses a stack of convolutional layers (ConvNet) that extract rich and discriminative features. First, the ConvNet scans the rectified image and generates a feature map of height one using scaling. Next, the feature map is split along the row axis and transformed into a feature vector v. Next, a multi-layer bidirectional long short-term memory (BLSTM) network is used to analyze the feature vector. The BLSTM bidirectionally captures the long-term dependencies in feature vector v and generates a new feature vector v n of the same length, see [40] for details.
The decoder is based on an attention sequence-to-sequence bidirectional model. It processes the input sequence vector (decoder output) v n , from left-to-right and right-to-left and generates two output sequences S lr and S rl , respectively. The output results S lr and S rl are merged on the basis of the height recognition score.

Datasets
There is no publicly available dataset for Pakistani license plates. Therefore, one of our main contributions is in creating our dataset, called the Pakistani License Plate Dataset (PLPD). This dataset contains 6000 images of license plates that vary in style, font, color, illumination, and view angles. Moreover, we generated ground truth (annotations) for each image. The annotation consists of the bounding box information (height, width, and midpoint) and the license plate text. Finally, the dataset is split into training and validation sets with a ratio of 80% and 20%, respectively.
The performance of the proposed approach is also compared with other methods using two publicly available datasets: artificial Mercosur license plates (AMLP) [41] and the Roboflow license plate dataset (RLPD) [42]. AMLP [41] and the RLPD [42] contain 3839 and 350 images with ground truth annotations, respectively.

Performance Measures
The intersection over union (IoU) is used to evaluate the localization results. This estimates that how much the predicted bounding box matches the ground truth and is given by: The IoU returns a score in the range of [0, 1]; a higher score indicates good localization and vice versa. Moreover, the recall, precision, F1 score, and accuracy are used to evaluate the recognition results.

Results
The model is trained for 300 epochs. The training and validation losses for localization are shown in Figure 3. In the first iteration, the training loss was 18, and the validation loss was 15.3. Both the losses decreased with successive iterations and converged at 0.1127 and epoch 300. Similarly, Figure 4 shows the average intersection over union (IoU) and average recall scores for different iterations during training. Initially, the IoU and recall started at 0.2227 and 0.1 and achieved final scores of 0.8 and 0.9.  Figure 5 shows the localization and recognition results of the proposed approach for different images taken from the PLPD. The given images vary in terms of style, font, and orientation. The license plate in Figure 5a contains an unknown font where the digits are preceded by a space and then two English alphabet characters. In Figure 5b,c, the license plates consist of two lines with variations in terms of hyphen and monogram. Similarly, the license plates in Figure 5d-l contain one line but vary in terms of orientation, number alphabets, the number of digits, and their separating symbols (-, *).
The license plates in Figure 5d,k,l contain extra text, such as "GOVT OF SINDH", "GOVT OF PAKISTAN", and "ICT Islamabad", respectively. In Figure 5l, the extra phrases "CHAKWAL" and "SUZUKI CHAKWAL" can be seen, which look like license plates. However, the proposed pipeline perfectly localized and recognized the license plates irrespective of the mentioned variations.  (a,b,g) are non-standard customized plates while the remaining subfigures show standared plates with different regional designs. In (g,d) the license plate is recognised but other text is ignored. The results of the proposed approach were compared with two recent approaches: deep automatic license plate recognition system (DALPR) [34] and Korean license plate recognition system using combined neural networks (KLPR) [43]. The methods mentioned above use deep learning to localize and recognize Indian and Korean license plates. Figure 6 shows the qualitative comparison of the proposed method with DALPR [34] and KLPR [43] for images. Figure 6a-d, presents the results of DALPR [34], while images in Figure 6e-l depict the results of KLPR [43] and the proposed method, respectively.
The qualitative results show that the proposed method was able to localize the license plate and recognize the text accurately. Our method consistently outperformed the two other methods. Moreover, DALPR [34] failed to localize the license plates in certain cases, such as Figure 6b,d.   Figure 7 depicts a complex situation where the front of the car contains a license plate "JPK 6546", which is surrounded by three other number plate-like tokens, including "SBA 1234A" (top), "SBV 966S" (left), and "FBF 1234A" (right). In this situation, DALPR [34] failed to detect the license plate while KLPR [43] drew two bounding boxes but failed to recognize even a single letter.
Note that the proposed method perfectly detected and recognized the license plate. Similarly, the second and third columns of Figure 7 present the results under low light, such as evening (second column) and fog (third column) conditions. In both situations, KLPR detected the license plate with double bounding boxes and failed in the case of recognition. On the other hand, both DALPR and the proposed method worked fine in the tested low lighting situations. Table 3 compare the localization results in term of the intersection over union (IoU). Table 3 shows that the proposed approach outperformed the other methods and achieved the best scores for the IoU. To evaluate the recognition performance, the proposed method was compared with DALPR [34] and KLPR [43] using three datasets: the Pakistani license plate dataset (PLPD), artificial Mercosur license plates (AMLP) dataset [41], and Roboflow license plate dataset (RLPD) [42].
Tables 4-6 present the quantitative analysis of the recognition results in term of the accuracy, recall, precision, and F1 score. It is clear from the mentioned tables that the proposed method outperformed the other methods and achieved the best score. Note that the DALPR [34] method failed in the case of the AMLP [41] and RLPD [42] datasets because it is designed to handle license plates of fixed length, standard color, and standard format. The proposed method is more general and robust to variations in license plate styles and orientation.

Conclusions
In this paper, we presented a deep-learning-based approach for the localization and recognition of Pakistani license plates, which vary in font and style. The proposed method comprises three steps: localization, rectification, and recognition. The localization detects the license plate in the image and extracts the region of interest. In some cases, the view angle between the camera and car affects the license plate's orientation and leads to shearing effects. The mentioned shearing effects make recognition difficult due to volumetric strain and changes in style. To handle the shearing effects and uniformly align the text, rectification is used. Finally, recognition is performed.
The proposed pipeline was evaluated on a newly developed dataset of Pakistani license plates, i.e., the PLPD, which contains 6000 images covering various Pakistani license plates. Extensive experiments were performed to compare the performance of the proposed approach with state-of-the-art deep learning methods. The results show that the proposed approach performed better compared to other methods. In the future, this work may be extended to localize and recognize license plates written in other languages, such as Urdu and Pashto.