Real-Time Vehicle Make and Model Recognition System

A Vehicle Make and Model Recognition (VMMR) system can provide great value in terms of vehicle monitoring and identification based on vehicle appearance in addition to the vehicles’ attached license plate typical recognition. A real-time VMMR system is an important component of many applications such as automatic vehicle surveillance, traffic management, driver assistance systems, traffic behavior analysis, and traffic monitoring, etc. A VMMR system has a unique set of challenges and issues. Few of the challenges are image acquisition, variations in illuminations and weather, occlusions, shadows, reflections, large variety of vehicles, inter-class and intra-class similarities, addition/deletion of vehicles’ models over time, etc. In this work, we present a unique and robust real-time VMMR system which can handle the challenges described above and recognize vehicles with high accuracy. We extract image features from vehicle images and create feature vectors to represent the dataset. We use two classification algorithms, Random Forest (RF) and Support Vector Machine (SVM), in our work. We use a realistic dataset to test and evaluate the proposed VMMR system. The vehicles’ images in the dataset reflect real-world situations. The proposed VMMR system recognizes vehicles on the basis of make, model, and generation (manufacturing years) while the existing VMMR systems can only identify the make and model. Comparison with existing VMMR research demonstrates superior performance of the proposed system in terms of recognition accuracy and processing speed.


Introduction
Transportation of goods and people is vital activities in the contemporary world.Transportation contributes to economic prosperity and quality of life.It also has its adverse effects like pollution, resource consumption, fatigue due to driving and traffic congestions, and personal safety risks due to accidents.The projection of the global vehicle count is an inexact process, but studies have shown an exponential increase.The estimated current global vehicle count is over 1.2 billion and, according to studies, this number will cross 2 billion in 2035 [1] or in 2040 [2].Due to the increasing number of vehicles, automated vehicle analysis is an important task in many applications.
The taxonomy of vehicle analysis is depicted in Figure 1.Vehicle analysis starts with the vehicle detection.Once the vehicle is detected, we can classify it based on its class (car, bus, truck, etc.), make (Toyota, Honda, Ford, etc.), color (white, black, red, grey, etc.), or make and model (Toyota Corolla, Hando Accord, Ford Fusion, etc.).Autonomous vehicles and driver assistance, surveillance, traffic management, and law enforcement are a few of the applications taking benefit from automatic vehicle analysis.It is inconceivable for humans to monitor, observe, and analyze the ever-increasing number of vehicles manually, especially in urban environments.In contrast to the human operator, the computer vision application can monitor traffic for a longer period of time without any fatigue.The associated cost of computer applications is less and can be scaled to achieve the desired performance/cost ratio.Automatic License Plate Recognition (LPR) systems present common computer vision applications that are widely deployed across the world.LPR is a well-understood problem with compelling recognition accuracy rates.LPR systems are installed in many countries for different purposes like law enforcement, electronic toll collection, crime deterrent, traffic control, etc. LPR systems identify a vehicle based on attached license plate.However, when two license plates are swapped, the LPR system will still recognize both license plates but is, inherently, incapable of recognizing the true identity.License plates can be easily forged, occluded, and damaged.Three examples where it is nearly impossible to recognize the identity of the vehicle are given in Figure 2. In the absence of an augmenting system that links license plate numbers to a vehicle make and model, the current LPR systems remain vulnerable to many malicious attacks.In many police activities like responding to a hit and run accident, an amber alert, or a hot pursuit; the vehicle make and model are typically available regardless of the lighting conditions.The license plate number might also be recognized by an eyewitness, but sometimes it is not observed or only partially observed.A Vehicle Make and Model Recognition (VMMR) system provides great value in terms of vehicle monitoring and identification based on the appearance of the vehicle instead of the attached license plate.
Authorities can query the VMMR system based on the vehicle's description or partial number plate to find all similar vehicles in a specified area during a particular time.Hence, LPR and VMMR systems can be used to complement each other.
The VMMR problem can be treated as a multi-class image classification problem, where each class represents a specific make and model.However, more challenging and diverse challenges are associated with VMMR as compared to other problems.Few of the challenges are listed below [5]: 1. Image acquisition in an outdoor environment.2. Varying and uncontrolled illumination conditions.3. Varying and uncontrolled weather conditions.4. Occlusion, shadows, and reflections in captured images.5.A wide variety of available vehicle appearances.6. Visual similarities between different models of different manufacturers.7. Visual similarities between different models of the same manufacturer.8. Tiny differences depending on the generation (group of consecutive manufacturing years).
The vehicle images used in our work reflect real world situations as they are captured in diverse weather conditions, with different lighting exposures, having partial occlusion (e.g., pedestrians), and from different viewing angles.The underlying goal is to discover the ability of supervised learning to resolve the applied computer vision problem of identifying the make, model, and manufacturing year of vehicles given the stringent limitation of the problem environment.The proposed VMMR system classifies vehicles images based on make, model, and manufacturing year while the existing VMMR systems can only identify the make and model.Vehicle models typically keep the same design shape for about five years before it is modified.We are using the term generation to describe the vehicle model having the same physical appearance but manufactured over one or more years.This article is organized as follows: Section 2 discusses the related work.The detailed system design along with feature extraction, machine learning techniques and VMMR datasets are discussed in Section 3. The efficiency and performance of the proposed VMMR is discussed in Section 4. Section 5 concludes the paper and provides direction for future work.

Vehicle Detection
Vehicle detection is the basis for vehicle classification problems.Vehicle detection confirms the presence of a vehicle in an image and extracts the region of interest to eliminate the background scene.In some cases, it is not effective to use the complete vehicle as input to the classifier and only the desired region (taillights, front lights, bumper, license plate, etc.) is extracted and used.The elimination of background and unwanted vehicle's portion enhance the vehicle classification performance.Huang et al. use background subtraction to extract the moving objects and apply image processing to discard unwanted image regions [6].Huang et al. train the system using a deep belief network to detect the vehicles.Lu et al. use YCbCr color space for modeling the background frame and Choquet Integral to fuse the texture features with color features [7].An adaptive selective background maintenance model is used to solve the complex conditions and variations.Faro et al. use luminosity sensors to detect the sudden variations in illuminations without affecting the time performance; background subtraction technique is used to differentiate the vehicles from the background and segmentation scheme is applied to eliminate the occlusion [8].Chen et al. compute Speed-Up Robust Features (SURF) for original and mirrored image and compute similarities between SURF features to find the horizontal symmetry [9].A center line is determined; every set of symmetrical SURF points and centerline represents a possible vehicle candidate.The shadow region is used to filter out weak candidates.A comprehensive survey of wide range of vehicle detection techniques can be found in [10].

Vehicle Type Recognition
Vehicle Type Recognition (VTR) classifies the vehicles into broad categories like car, bus, van, truck, bike, etc.; the exact make and model of the vehicle is not identified in VTR.An automated VTR system is helpful in applications like urban traffic studies and analysis, electronic toll collection, etc. Wang et al. use the geometrical information to construct features and adopt simple Euclidean distance-based matching to categorize the vehicle into three types [11].Dong et al. propose a two-level semi-supervised Convolution Neural Network (CNN) to learn local and global features and utilize softmax regression to categorize the vehicles in six classes [12].Fu et al. propose a VTR system based on hierarchical multi-SVMs and can handle complex traffic scenes and partial occlusion [13].Irhebhude et al. combine a local binary pattern histogram, Histogram of Oriented Gradient (HOG) and region features and use correlation-based feature selection to select discriminative features [14].They use a support vector machine (SVM) to classify the vehicles into four categories.

Vehicle Make and Model Recognition
Classical VMMR research classifies vehicles based on make and model only.Classical systems use local features to represent the vehicle's region of interest and require these features to be converted into global features' representation in some cases.Scale Invariant Feature Transform (SIFT) [15], SURF [16] and HOG [17] feature extraction techniques are used by many researchers.Nearest Neighbors Classifier (NNC), Artificial Neural Networks (ANN), and Support Vector Machine (SVM) are the most widely used classifiers for VMMR systems.
Boukerch et al. presented a real-time VMMR system and evaluated it in [18].SVM is used as single multiclass classifier and ensemble of the multi-class classifier.In this approach, the authors describe SURF features dictionary for global representation.They evaluate two dictionary building approaches; single dictionary and modular dictionary and report an accuracy rate of 94.5% with a processing speed of 7.4 images per second.Noppakun Boonsim and Simant Prakoonwit propose a one-class classifier-based approach under limited lighting [19].The proposed approach uses one-class SVM, decision tree, and K-Mean Nearest Neighbor (KNN) for classification and a majority vote of three is used for final prediction.They use rear view images to evaluate their proposed system.A grid-based method is used for shape features and aspect ratios of different attributes of taillight and license plate are used to represent geographical features.A genetic algorithm is used for feature subset selection which improves the accuracy slightly from 93.4% to 93.8%.
Edges based features are explored in [20][21][22][23][24].In these approaches, dependence on edges can lead to failure of the system due to occlusion.Petrovic et al. concatenate the raw pixels, Sobel edges, edge orientation, Harris corner response, normalized gradient and other image features to build feature vector and apply principal component analysis to reduce the dimensionality of the feature vector [20].The Nearest Neighbors method is used to classify the vehicle make and models.Pearce et al. use KNN and Naïve Bayes for classification and use canny edges, Harris corners and Square Mapped Gradient (SMG) to construct the feature vector [21].They propose to concatenate Locally Normalized Harris Strengths (LNHS) or SMG for global representation.The authors use the small and simplistic dataset to evaluate the proposed system.Vajas et al. [22] also use concatenated SMG for global representation and Clady et al. [23] use concatenated oriented contour points from Sobel edges.Both Vajas and Clady use Nearest Neighbors as a classifier for their proposed VMMR system.Munroe et al. use canny edges and classify using several techniques like KNN, ANN, C4.5, and decision trees [24].
SIFT based VMMR systems are proposed in [25][26][27][28].Psyllos et al. use a two-step approach [28].They use phase congruency to identify the vehicle logo and then SIFT features to identify the specific model.Probabilistic Neural Networks are used for classification.The authors test the proposed approach against simple and non-occluded images.Different viewpoints and variation in illumination are also not considered.Even then a low accuracy rate of 54% is reported.Dlagnek use SIFT and a brute force matching algorithm in his work [25].Exhaustive matching, used in this work, is a very time-consuming process.Baran et al. use SIFT, SURF and HOG features and define dictionaries for global feature representation [26].Baran use multi-class SVM with very large dictionaries to represent the input images.Fraz et al. extract SIFT features and form a lexicon comprising of all the features from training dataset as words [27].Fisher encoded representation is used to compute the lexicon for image features, SIFT.The Fisher encoded scheme is computationally expensive and the authors report the processing time of about 0.4 s for every image.Jang et al. use SURF features and bag-of-words model for global feature representation [29].The authors have created a dataset using multiple toy cars and a structured matching technique for classification.
A global feature representation based on a grid pattern is proposed in [9,30].Hsieh et al. divide input image into a grid and compute SURF and HOG for each block independently [30].The authors train ensemble of SVM and combine the results to determine the final decision.Chen et al. [9] compute HOG features for the grid-based pattern and concatenate HOG features for global representation.By testing our system with their dataset, we show that our system performs well in terms of recognition accuracy and processing speed.The grid-based schemes assume a fixed camera and are prone to failures in cases where the camera height, pitch or yaw may change, resulting in vehicle views which the system might not be trained for.

Dataset
Pearce and Pears [21] create a dataset having 177 images from 21 vehicle classes; each vehicle class consists of five or more images.Eighty-five more images are added from other uncommon vehicle classes (53 classes); each class having one or two images.Testing dataset is created by applying 'leave-one-out' scheme over 177 images.Jang and Turk [29] use 20 toy cars to create the dataset.They capture images from 16 different viewpoints for each toy car.The training dataset is created with 2650 images and the proposed VMMR is tested with 801 images.Psyllose et al. [28] create a training dataset with 10 classes, each class having five images.An unknown class with five images is also added to training dataset.Testing dataset is created with the same pattern except each class contains 10 images.Jonathan Boyle and James Ferryman [31] create a dataset for side view vehicle images.The side view vehicles dataset is comprised of more than 10,000 images with 86 make/model categories.The authors use 50% of images of each class with an upper limit of 200 images for the training purposes while the rest of the images are used for testing.Baran et al. [26] create the dataset by downloading images from the internet or capturing images outdoor.The training dataset has 17 vehicle classes and 80 images for each class; thus, 1360 images are available for the training process.The testing dataset consists of 2499 images.The testing dataset has images with degraded quality and lesser resolution as compared to the training dataset.There is no occlusion in images.
We adopt a realistic and publicly available dataset-the National Taiwan Ocean University-Make and Model Recognition (NTOU-MMR) dataset [9].The NTOU-MMR dataset is used in multiple studies like [9,18].Chen et al. [9] propose a VMMR system using symmetrical SURF and created a dataset under Vision Based Intelligent Environment (VBIE) project [32] in 2014 and named it the NTOU-MMR dataset.The dataset can be accessed at [33].Jabbar et al. [18] use the NTOU-MMR dataset in their work.Hence, we can compare our VMMR performance with Chen et al. [9] and Jabbar et al. [18].The dataset is divided into a testing and training dataset and vehicles are divided into twenty-nine different classes based on make and model.Vehicles belonging to six manufacturers are available in NTOU-MMR dataset; the manufacturers are Toyota, Ford, Mitsubishi, Honda, Suzuki, and Nissan.Few of the sample images are shown in Figure 3 to illustrate the variability of the dataset.The NTOU-MMR dataset provides the following characteristics which motivated us to use it to train and evaluate our VMMR system.We believe the five listed characteristics bring the NTOU-MMR dataset closer to real-life situations compared to other datasets available:

•
The dataset contains images of stationary and moving vehicles with a speed up to 65 km/h.

•
The dataset images contain vehicles with several viewing angles ranging from +20 degrees to −20 degrees relative to a scene directly from the front of a vehicle.

•
The dataset images are captured throughout daytime and nighttime.

•
The dataset is created with varying weather conditions between sunny, rainy and cloudy.• Some of the vehicles are partially occluded by an irrelevant object like a pedestrian.Vehicles' classes are defined based on make and model in this dataset; we added generation to fit the dataset with our vision.We have divided the dataset into classes based on make, model and generation of vehicles.Now, the dataset is divided into thirty-five classes.The detailed description of the dataset is given in Table 1 that lists the number of images available for training and testing process.The NTOU-MMR dataset has a few problems such as the small number of training images for a few categories.There are no testing images available for Toyota RAV4 in the original dataset.We use the dataset as it is available except the change made in the number of classes.

Hardware and Software Platform
We have used an Intel R Core TM i7 processor (3.4 GHz) with 16 GB of RAM to perform all our experiments.The VMMR is implemented using MATLAB R R2012a on a 64-bit Microsoft Windows 7 operating system.

Methodology
We develop a real-time VMMR system based on machine learning and computer vision techniques.The input to the system can be either images or videos.In case of videos, frames can be extracted at regular intervals.The vehicle detection module verifies the existence of a vehicle in the current image and the vehicle detection process also localizes the vehicle within the image.If a vehicle exists in the image, it is further processed; otherwise, the image is discarded.We define the ROI to represent the part of the vehicle in the image that provides the discriminative and prominent features.The discriminative and prominent features are easily distinguishable between different vehicles.We have the frontal vehicle images in our work to design the VMMR and used bumper, front lights, and bonnet area as ROI in our work as shown in Figure 3.The ROI extraction module removes the background as well as part of the vehicle from any given image that is not helpful during classification and can degrade the classification performance.We use a vehicle detection technique proposed by Chen et al. [9] for vehicle detection and ROI extraction.
The next step is to extract features from the input image.Features provide an image representation method that is more suitable for computer models such as machine learning and pattern recognition applications.The features are used to represent a scene or the uniqueness of an object.Once the prominent interest points are determined, feature descriptors are computed for the interest points.The feature descriptor provides a robust and invariant representation for the interest points.Various feature extraction and descriptors are available in the literature.We use HOG and GIST image features in this work.Feature extraction techniques may extract a variable number of features in an image.Global feature representation is the process to combine all the extracted feature points to represent an entire image feature.Global feature representation generates an image feature vector which represents all images with the same dimensionality and the same pattern.Lastly, the VMMR uses a supervised learning to enforce the image classification on the learning engine.The classifier is trained using image feature vectors generated for the training dataset and new incoming observations are then categorized using the trained classifier.The classifier's performance can be measured in terms of correctly identified, incorrectly identified and missed observations.The testing dataset contains the new unseen observations which are used to determine classifier's performance in terms of a successful recognition rate.

Feature Extraction and Representation
We divide image features into two categories for VMMR problems: local features and global features.A local feature is defined on the basis of a prominent point/patch of the image.An image can have a variable number of features.A global feature is either computed based on the entire image or on every part of the image.Every image in the dataset has the same number of global features.Global feature representation techniques are applied to combine local features to construct image feature vector with the same dimensionality and pattern for every image.In case of global features, all the features are simply concatenated to create an image feature vector.Local features based VMMR system are reported in our previous works [5,34].There is no significant performance gain (recognition rate and processing speed) of using only local features.Our experimentation with global feature representation (required for local features) reveals that it decreases the processing speed.In this work, we use HOG and GIST features which utilize entire images not just the prominent local points.

Histogram of Oriented Gradients
The HOG feature introduced by Dalal and Triggs [17] for robust human detection has been seen widely used for object recognition.Every object or shape in an image is composed of a collection of lines (edges), thus we can describe objects within an image by using the distribution of gradient orientations (directions).HOG divides the image into small connected and overlapped regions known as cells; the gradient directions are computed for every pixel in the region and a histogram is created for the gradient directions.Every cell's histogram is concatenated to generate the final HOG descriptor.Normalization of feature vector results in better invariance to changes in the illumination and shadowing.The histogram representation lessens the impact of noise.We use different window sizes to create a HOG feature descriptor in our work.We do not combine multiple windows to construct a bigger overlapping block.We create a HOG descriptor with the smaller size as compared to the standard HOG descriptor which results in improved the processing speed.The standard HOG creates a feature descriptor with a size of 3780 elements for a 128 × 64 image, but the size of an HOG descriptor without overlapping is 1152 elements for the same image and same configurations.All the images in the NTOU-MMR dataset do not have the same dimensions; instead of fix sized cells, we divide every image into an equal number of cells.This technique helps us create an image feature vector using simple concatenation and without applying a global feature representation technique.

Gist Feature Descriptor
Humans are capable of classifying an object or scene with a glance without considering the details present in an image.For example, after viewing an image of tall buildings or trees or ocean, we can instantly recognize the scene without thinking of the details or existence of other objects.The GIST of a scene [35] refers to the information contents gathered in a glance.The GIST feature descriptor is "a low dimensional representation of the scene, which does not require any form of segmentation" [36].The GIST descriptor was initially proposed for scene classification.SIFT and SURF focus on individual prominent points and the HOG feature descriptor is computed based on individual windows (patches) and concatenated later, whereas the GIST descriptor focuses on the shape of an entire image as a single object and calculates the feature vector.The GIST descriptor ignores the presence of local objects and their relationships.Therefore, GIST provides a holistic representation of a scene.We create GIST descriptor with four scales and eight orientation creating thirty-two transformed images.Gabor transform is applied on these thirty-two images to create feature maps and the feature maps are divided into the 4 × 4 grid or 16 blocks, which generates a 512-dimensional feature vector (16 averaged values × 32 feature maps).

Classification
Designing VMMR requires the identification of the specific vehicle in terms of its manufacturer, model, and generation.After feature extraction and image feature vector construction, the next step is to train the classifier which is used to recognize the new incoming observations.During the training phase, classifiers learn the intra-class similarities (multiple vehicles belonging to the same category) and inter-class differences (vehicles belonging to different categories) and build up a model that is used later for recognizing the unseen vehicles.Among the multiple available classifiers, none was found to perform optimally at all different types of applications [37].Data scaling, the presence of outliers and noise, redundant attributes, overfitting, and underfitting are a few of the factors that affect most classifiers' performance.We use SVM and Random Forest (RF) classification techniques in this work and comment on the effect of image imperfections.

Support Vector Machine
The SVM [38,39] is a supervised learning method and efficient binary classifier.The SVM uses a subset of training data observations, known as Support Vectors (SVs), to represent the optimal separation between two classes.SVM is robust to overfitting especially for a dataset with higher dimensions (features).SVM can perform efficiently in case of nonlinear separable data by using nonlinear kernel functions.Cover's theorem [40] states that a nonlinear kernel function is more likely to generate a linearly separable data points in higher dimensional space when it is applied on a linearly inseparable data.More details on the characteristics of kernel functions and their construction can be found in [41].SVM is a memory intensive algorithm and the selection of kernel can be trickier.
An ensemble of binary classifiers is used to create multi-class SVM classifier.One-versus-rest [42] and one-versus-one [43] are two approaches used for multi-class SVM classification.The results of the binary SVMs can be combined in different ways for classification, such as majority voting, least square error weighted outputs, and double layer hierarchical combination.We used one-versus-one approach in our work.Hsu and Lin have compared both approaches in [44] and concluded that both approaches have comparable performance except one-versus-one require lesser training time as compared to one-versus-rest approach.

Random Forest
The RF classification is ensemble learning approach proposed by Leo Breiman [45].Weak binary decision trees are used to create an ensemble in RF Classification.RF constructs a multitude of decision trees during the training process and the final class is determined using a mode (majority voting) during the testing process.Ho [46] initially introduced the idea of the RF by using Random Subspace method [47].Brieman combined the creation of random subsets of training data named as bagging with Ho's idea of randomly subsampling of training features to build the decision trees.The observations in the datasets often have missing values; the features may have unavailable, corrupt or invalid values.RF classifiers produce good results on missing data [48].RF classification does not require tree pruning and can overcome the decision trees' problem of overfitting [48].The increase in the number of decision trees' results in a reduction in an overfitting problem, but, on the other hand, also results in an increase in training and testing time.RF classification is easily scalable and can model nonlinear decision boundary naturally due to their hierarchical structure.

Results and Discussion
VMMR uses two machine learning approaches: SVM and RF.Both image features, HOG and GIST, are used with each classifier making it four combinations.The experiments are performed multiple times and the averaged results are presented here.We discuss the computational time required for feature extraction.Then, we discuss VMMR results for each machine learning approach and feature extraction combination in terms of recognition rate and processing speed.Lastly, we compare our results with other VMMR research.The recognition rates are only computed for VMMR.The performance of vehicle detection and ROI extraction is not accumulated in the recognition rates.
The computational time is an important factor for any real-time application.The measured computational times for GIST and HOG (with different configuration) are provided in Table 2.The total time required for computing each configuration is provided in seconds per 100 images.As we increase the number of blocks in HOG the computational time increases.The training and testing datasets undergo through the same process in case of HOG and GIST features; hence, the computational time required is the same for the training and testing phases.The RF is trained with configuration of 100, 150, 200, 250, 300, and 350 decision trees.We use the term RF-VMMR to refer the VMMR system with RF classification.The other two important parameters used in RF training are the number of randomly selected (with replacement) samples used to grow each decision tree and the number of randomly selected attributes to consider for each decision tree; we select the values of these two parameters based on experimental analysis.The same parameters are used during the testing phase.We observe that the recognition rate decreases if we reduce the size of training subset for RF.Hence, we use all the training dataset to grow the decision trees.Similarly, we observe that RF performs better with the number of selected attributes equal to the square root of the total number of attributes for our RF-VMMR system.The RF-VMMR recognition rates are shown in Figure 5a  As depicted in Figure 5a,b, the increase in the number of decision trees in the RF algorithm increases the recognition rate initially, but, after certain thresholds, further increase in the number of decision trees negatively affects the recognition performance.Although this threshold is not fixed for all of the variations of dataset representations, in our dataset, the recognition rate decreased after 300 decision trees in most of the cases.The behavioral pattern whereby there is an increase in the recognition rate followed by decrease as the number of decision trees increases can be seen for every feature extraction technique and variation.Although there are small variations, the overall recognition rate follows the same behavioral pattern.We use a linear kernel in this work.A linear kernel is a good option when the numbers of features are greater than the number of observations [49].When applying the VMMR system to the NTOU-MMR dataset, we use 2750 training observations where the number of features varied from 512 to 5000 for each observation depending on the feature extraction technique and configuration employed.Hsu et al. reported that the mapping of data into a higher dimensional space does not improve the performance in the case of a large number of features [49].The linear kernel also results in faster training as compared to other kernels [49].The regularization parameter C is used to control the margin of the hyperplane separating two data classes.A larger value for C means a smaller margin between the two classes.We train the SVM algorithm with C = 2, 4, 6, 8, 10, and 12.We use the term SVM-VMMR to represent the VMMR system with SVM classification.The recognition rates for the SVM-VMMR system are shown in Figure 7a  As depicted in Figure 7a,b, the increase in the number of blocks increases the recognition rate up to a certain point, and, then, the recognition rate decreases.HOG achieves its best recognition rate of 94.43% with 33x6 blocks in the case of RF-VMMR (350 Decision Trees) and 97.89% with 33 × 9 blocks in the case of SVM-VMMR (C = 10).GIST achieves its best recognition rate of 94.53% in case of RF-VMMR (300 decision trees) and 97.20% in case of SVM-VMMR (C = 10).We can conclude based on the recognition rate that GIST and HOG perform similarly for RF-VMMR and SVM-VMMR in terms of recognition rate.
All the parameters are determined using the training dataset.However, we have reported all the results (performed on a testing dataset) to illustrate the effect of variation in other parameters like number of trees, C, feature extraction configuration.The training of a VMMR system can be performed offline hence it does not put any timing constraint.The processing speed (images per second) of recognition process for RF-VMMR is provided in Table 3 and for SVM-VMMR in Table 4.The processing speeds are measured based on the accumulated values of the feature extraction and representation time and recognition phase.The first column tells about the feature extraction technique and the second column tells about its configuration.The remaining six columns provide the processing speed for different RF and SVM configurations.HOG process 13.9 frames per second (SVM-VMMR).The processing speed reduces to 10.1 frames per second with the inclusion of computation time of vehicle detection module.The training time is not considered here as the training process is performed offline and does not impose any temporal constraint over the final system.The processing speed for GIST is almost the same in case of SVM-VMMR and RF-VMMR, whereas, for HOG features, the RF-VMMR processing speed is almost twice as fast as SVM-VMMR.The values for the number of decision trees for RF and the margin C for SVM have very little effect on the computation time for the testing phase as can be seen in Tables 3 and 4. The difference between the processing speeds are due to the feature extraction and representation techniques.The processing speed of the RF-VMMR for the HOG features (33 × 6 blocks) is 35.7 images per second with the recognition rate of 94.43%, whereas the processing speed of the SVM-VMMR for the HOG features (33 × 9 blocks) is 13.9 images per second with the recognition rate of 97.89%.RF-VMMR and SVM-VMMR systems with GIST features yield a similar recognition rate for both systems.In addition, the processing speed is lower with GIST features than with HOG features.The feature extraction step requires more computational time which can easily be processed in parallel to increase the processing speed of the system.The time required for feature extraction is very high as compared to the RF testing process so an increase in the number of decision trees does not affect processing speed much, whereas processing speed is affected by the feature extraction configuration.The value of C can affect the training time, but, once the system is trained, all the new coming observations go through the same type of computation.Hence, the processing speed remains the same for all different values of C for the same feature extraction configuration.
The proposed VMMR systems are compared with nine other VMMR approaches in Table 5 with respect to the recognition rate and processing speed.Our VMMR systems outperform other VMMR systems, in terms of both recognition rate and processing speed.The results of our proposed VMMR systems, given in Table 5, are the best outcomes among all the feature extraction and machine learning algorithm variations.Chen et al. [9] and Jabbar et al. [18] also use the NTOU-MMR [33] dataset to test their work.Our SVM-VMMR system outperforms both of their systems in terms of recognition rate and processing speed.Most previous approaches use local features such as SIFT, SURF, edges, corners, etc. to represent the images, which may be the reason for their poorer performance.

Conclusions and Future Work
This work presents a real-time VMMR system with better performance than existing VMMR systems in terms of recognition rate and processing speed.A publicly available NTOU-MMR dataset based on realistic assumptions is used in this work.The dataset is modified to include a vehicle's generation information along with make and model.We have used HOG and GIST to represent the images and SVM and RF to classify the vehicles.We have shown using the experimental analysis that our system is suitable for real-time applications with a higher recognition rate.The proposed system works well in challenging situations where vehicles are partially occluded, partially out of the image frame or poorly visible due to low lighting.This system can provide great value in terms of vehicle monitoring and identification based on vehicle appearance instead of the vehicles' attached license plate.The existing VMMR research focuses on recognizing vehicles sufficiently to report only their make and model.We have included generation as another parameter.Thus, our VMMR system recognizes a vehicle and provides information about vehicle make, model and generation.
Although the proposed VMMR system outperforms the previous systems, it can be further enhanced.Image feature vectors have a large number of features/dimensions.Dimensionality reduction techniques can be explored to reduce this number.A publicly available better and larger dataset with more vehicle types will benefit the research in this area.Deep learning techniques can also be explored with a bigger dataset.
Computer vision techniques are used to express images in fewer attributes that characterize vehicles.Machine learning techniques are used to classify the vehicles.The overall architecture of vehicle classification system is given in Figure 4.The VMMR system is divided into two subsystems: training subsystem and testing/classification subsystem.The training subsystem is used to train the VMMR engine using a subset of the available dataset, whereas the classification subsystem recognizes make and model of vehicles in new images never used for training.Vehicle detection, Region of Interest (ROI) Extraction and feature extraction are common for both training and testing tasks.Global feature representation and classification components are different in the training process than in the testing process.The global feature representation module may generate a model for the encoding of images' features depending on the applied technique.Similarly, the classifier module produces a model as a result of the training process that is then used by the testing module to predict the outcome for the newly examined images.Hence, the arrows from the training process to the testing process represent the usage of models in testing process created during the training process.The proposed system works without the global feature representation component.The extracted features are directly fed into the classifier.The omission of the global feature representation component improves the processing speed of the VMMR without degrading the recognition accuracy.
,b.The vertical axis of both figures represents the RF-VMMR recognition rate in percentage.The horizontal axis shows the number of decision trees used for RF training for GIST features in Figure 5a and number of blocks used to construct HOG image features in Figure 5b.The recognition rate (vehicles correctly recognized) for different RF configurations (number of decision trees) are shown using different styled lines for HOG in Figure 5b.The confusion matrices for RF-VMMR are shown in Figure 6a,b.
,b.The vertical axis of both figures represents the SVM-VMMR recognition rate in percentage.The horizontal axis shows the size of margin used for SVM training for GIST features in Figure 7a and number of blocks used to construct HOG image features in Figure 7b.The recognition rate for each SVM configurations (size of margin) is shown using different styled lines for HOG in Figure 7b.The confusion matrices for RF-VMMR are shown in Figure 8a,b.

Table 2 .
Computation time for feature extraction per 100 images.

Table 5 .
Comparison of our work with others in terms of recognition rate and processing speed.