The proposed framework is comprised of license plate character segmentation (LPCS) and license plate character recognition (LPCR) modules. The detailed proposed architectures of these modules are discussed in this section.
3.1. License Plate Character Segmentation
Figure 2 shows the complete process of the license plate characters segmentation. As discussed above, we used part-based approach to accomplish this task. The first objective is to segment the region of interest (ROI) in order to reduce the image processing area and to discard a lot of redundant information. In the second part, a proposed character height estimation filter along with connected component analysis (CCA) is applied to extract the required objects. These blocks are discussed below in detail.
LP image detected by [
1] is fed to part-I for the foreground and background classification. To distinguish the background from the foreground is one of the most difficult and important steps for object segmentation. The proposed method strongly requires the identification of background/foreground color and polarity, as further segmentation processes based on this step. The term polarity is referred to bright or dark based on the intensity of foreground/background colors. We observed that most of the countries use unique color for background as well as for foreground. Probably, the largest color candidates would belong to background region while foreground would have second highest color candidates, which is the key point that we use for background and foreground classification. We propose a model by using RGB color space which is best suited to identifying any color even under various illumination conditions.
RGB-based color space can be visualized as a cube with corners of black, the three primaries (red, green, blue), the three secondaries (cyan, magenta, yellow), and white. The eight pure color corners and distribution cube is shown in
Figure 3 [
36]. By considering above mentioned colors and color distribution, a model is proposed that represents all shades of colors into 8 pure colors and has the capability to determine any background and foreground color, which is strongly required in case of multinational VLPs. The total candidates of red, green, blue, yellow, cyan and magenta are computed by (1) as per the color’s definition mentioned in
Table 1, and the threshold value is determined by using color distribution cube and rigorous experiments on 3718 images in the test dataset.
,
By using (1) we can get color candidates count (
) of red, green, blue, cyan, magenta and yellow, but we cannot set any hard limit to separate the group of black and white pixels as the illumination level is unknown. The values of white and black pixels (
are extracted together as one group and the candidate’s vector (
is obtained as
.
Then Otsu’s method based on two intra-class variance is applied to separate them into two classes having black and white pixels separately as shown in
Figure 4. Further, larger and smaller class between these two groups is determined as
Until this point, we have known the background and foreground colors because the largest candidates group belong to the background while the second largest candidates group belong to the foreground, as shown in
Figure 5a. This information is further used to determine the foreground polarity for post processing, as this information is most necessary for adaptive thresholding and for most morphological operations.
Adaptive thresholding technique strongly requires the prior knowledge of foreground polarity to separate background from foreground precisely. This information is also needed for morphological operations to obtain the required results. Algorithm 1 is used to accomplish this task, whereas max1 and max2 belong to
that represent the background and foreground respectively. The background and foreground polarities depend on the sequence of colors as shown in
Figure 5b. For example, Ind-1 has brighter polarity as compared to Ind-2. The CG2 (Ind) is determined as
Algorithm 1 Foreground polarity detection process |
Input 1: max1 % max1 represents the color pixel count belongs to LP background |
Input 2: max2 % max2 represents the color pixel count belongs to LP foreground |
If (max1 & max2) CG1 |
If max1 (Ind) > max2 (Ind) |
FP ← bright |
Else |
FP ← dark |
End |
Else if (max1 CG1& max2 CG2) || (max1 CG2 & max2 CG1) |
Output: |
FP← find {CG2(Ind)} |
End |
The next step is to convert the RGB image into binary image
. For this purpose, first the image is converted into gray image (
by using the standard conversion computed by (8) and then we use real-time adaptive thresholding for binary conversion using local mean intensity (first-order statistics) as mentioned in (9) with neighborhood size (Ns). This is most suitable thresholding technique compared to OTSU’s, MET and many others based on two intra-class variance to separate background and foreground particularly in shadow and illumination variance environment. Adaptive thresholding requires prior knowledge of foreground polarity which we have already determined in the above steps.
Figure 6 shows the effect of prior knowledge of foreground polarity. In
Figure 6a, the LP with dark foreground is well thresholded while in
Figure 6b, the LP with bright foreground is well thresholded. So, for multinational vehicle LPs, we must have prior knowledge of foreground polarities in order to separate foreground from background as multinational VLPs have different backgrounds and foregrounds polarities. After getting binary image, we change its background polarity to bright if needed by using (11).
The next step is to extract the ROI which contains only required characters by discarding redundant area of license plate. Let I be a given set of objects of 2 types, then required region
can be extracted as
where type 1 and type 2 represent the white and black objects respectively.
The character height estimation is also one of the most important and difficult tasks to accomplish the character’s segmentation for multinational LPs as different countries LPs have different character size. It also uses further for skew correction and to eliminate the other remaining redundant objects. Until this step, we have already discarded a lot of redundant area of LP by extracting the required background region. To accomplish this task, we propose an adaptive way to estimate the character’s height, which can be estimated as
The is a set that is used to keep the objects of specific height. Where, is the maximum height among all detected bounding boxes in extracted region ( of license plate. The is the targeted object that is selected based on the criteria of height . The next bounding box object is compared with , if its height is then it will be considered as required object. If it does not fulfil the criteria then the previous bounding box with maximum height will be discarded and a recent one will be considered as maximum height object for further processing. This process will be continued until all required objects are extracted. By following this approach, all larger and smaller objects are eliminated automatically and only required objects, i.e., license plate characters are left.
After finding the heights of required objects, we use this information for skew correction by using (14). The skew detection and correction problem is also addressed in an efficient and accurate way, as it helps to enhance the system performance at segmentation as well as at recognition stage.
,
and
is y-axis values of right most and left most detected objects while
and
is x-axis values of right most and left most detected objects of
hi.
The border touched characters are separated by using the information of upper and lower boundaries of objects in set
as shown in Figure 8c. The final step is to remove the remaining redundant information and to get bounding boxes on our required objects that are LP characters, so, this task is done by using (15)
where,
represents the final image that have only required LP characters.
The output of above discussed LPCS processes can be seen in
Figure 7 and
Figure 8. In order to prove the effectiveness of proposed method,
Figure 9 presents some sample images of LPs having blur, noise, shadow and also effected by various illumination conditions. The proposed method is not capable of handling the multi-color background/foreground LPs as shown in
Figure 10. As discussed above, most of the countries have unique color for background and foreground. Probably, the largest color would belong to background region while foreground would have second largest color, which is the key point that is used to distinguish the background from foreground. In this proposed approach, the rest of the processes are based on this information. That is why this method would give failed segmentation of the license plate characters.
3.2. License Plate Character Recognition
We propose a deep learning (DL) framework to accomplish the license plate character classification task as DL is most recently been introduced in approaches in the area of artificial intelligence. Deep learning is powered by neural networks. Convolutional neural networks (ConvNets or CNNs) are a category of neural networks that have proven very effective in areas such as image recognition and classification. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function on the last (fully-connected) layer.
Figure 11 represents the proposed structure for character recognition of multinational vehicles LPs. First, the segmented image is decomposed into red, green and blue channels and then it is passed through polarity matching module to process data uniformly in order to eliminate the impact of foreground polarity that varies in multinational VLPs. To hierarchically learn features, these separated channels images are fed to CNNs and get output vectors, which are further concatenated to acquire enhanced image’s feature information. Before sending this output feature vector to the classifier we introduce data normalization module which has a better impact on output scores. Finally, this normalized feature vector is fed to classifier in order to obtain target labels.
The pre-trained network is used as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster than training a network with randomly initialized weights from scratch. Learned features can be transferred quickly to a new task using a smaller number of training images. Some of the well-known pre-trained networks (AlexNet [
37], VGG-16 [
38], GoogleNet [
39], ResNet-18 [
40], Inception v3 [
41]) have been trained over a million images and can classify images into 1000 object categories and have learned rich feature representations for a wide range of images. The speed is one of the most important parameters as we are working for real-time applications. By keeping this constraint in mind, we choose AlexNet CNN as our base network for transfer learning as it is time efficient among the rest of the four well known CNNs, as shown in
Figure 12. The test time is considered for whole test database to test the speed of networks for our particular application. The test database for recognition part contains 21717 license plate characters images that are extracted at segmentation stage from 3718 license plate images of eight different countries.
In deep neural networks, higher-level convolutional layers have rich features representation while the spatial information is deficient. In contrast, in shallow layers, spatial information is reserved at the cost of less expressive features. Generally, features from the last convolutional layer are used by FC layers which are further used for classification. Max-pooling is a down sampling strategy in convolutional neural networks. Therefore, we may have a chance to get spatial information that might be lost during down sampling. Vanishing gradient is another problem that may occur in deep networks. For recognition and detection tasks, performance can be enhanced and information loss can also be reduced by using collective feature information of different convolutional layers, as the authors suggested in [
42,
43,
44,
45]. In addition, it is noted that the convergence time is also improved by using the concept of deep layers aggregation. Therefore, we propose an improved CNN at the lines of AlexNet, as shown in
Figure 13 and parametric detail is discussed in
Table 2. In improved CNN, the feature of convolution layer 4 and 5 are merged and one extra max-pooling layer is also introduced to match the features dimensions of both convolution layers.
The next most important step is to choose the classifier that accepts the feature vector from CNN feature learning module and generates output labels. In [
46,
47,
48], the authors claimed that the SVM is a strong and fast classifier for real-time classification applications and great attention has been paid to the fusion of neural networks and SVM [
49,
50]. That is why the same is used in our proposed system. For the support vector machine algorithm, kernel and hinge loss function is used as described in (16) and (17) [
51].
where
G(
xj,
xk) is element (
j,
k) of the Gram matrix, where
xj and
xk are
p-dimensional vectors representing observations
j and
k in X.
where
f(x) =
xβ +
b,
β is a vector of
p coefficients,
x is an observation from
p predictor variables and
b is the scalar bias.