Vehicle Detection with Occlusion Handling, Tracking, and OC-SVM Classification: A High Performance Vision-Based System

This paper presents a high performance vision-based system with a single static camera for traffic surveillance, for moving vehicle detection with occlusion handling, tracking, counting, and One Class Support Vector Machine (OC-SVM) classification. In this approach, moving objects are first segmented from the background using the adaptive Gaussian Mixture Model (GMM). After that, several geometric features are extracted, such as vehicle area, height, width, centroid, and bounding box. As occlusion is present, an algorithm was implemented to reduce it. The tracking is performed with adaptive Kalman filter. Finally, the selected geometric features: estimated area, height, and width are used by different classifiers in order to sort vehicles into three classes: small, midsize, and large. Extensive experimental results in eight real traffic videos with more than 4000 ground truth vehicles have shown that the improved system can run in real time under an occlusion index of 0.312 and classify vehicles with a global detection rate or recall, precision, and F-measure of up to 98.190%, and an F-measure of up to 99.051% for midsize vehicles.


Introduction
The main goal of Intelligent Transportation Systems (ITS) for an Internet of Things (IoT) Smart City is to improve safety, efficiency, and coordination in transport infrastructure and vehicles by applying information and communication technologies. To this end, it is necessary to have systems capable of collecting road information and monitoring traffic.
Video cameras are a good choice for these tasks, because they are non-intrusive, easy to install, and of moderate cost. In addition, advances in analytical techniques for processing video data, together with increased computing power, may now provide added value to cameras by automatically extracting relevant traffic information, such as volume, density, and vehicle velocity.
According to the type of sensors (active or passive) and its location, different approaches for detecting and classifying vehicles has been developed, such as: on-road camera [1][2][3][4], rear and forward looking cameras onboard [5], low-altitude airborne platforms with vision [6,7], and non-camera on the road [8][9][10]. • Only reference [20] has uses greater number of Ground Truth (GT) points in the detection than us, but they used only 3326 for the classification. Therefore, our work shows the greatest number of GT for classification.

•
The detection rate DR or recall of 100% reported in [37] was achieved in a restricted scenario for only nine GT vehicles in 1000 frames; so, it's not valid.

•
The most of papers don´t give information about videos, that can be downloaded and tested; or they are too short, or not show an easy replication. • Background models are addressed in the following highly cited articles [34,[40][41][42], but all are based on assumptions that background pixel values show higher frequencies and less variance than any foreground pixel. Although the occlusion is not handled in these papers. • Background-foreground algorithms transform input videos or photos, with occlusion handling or not, into an output space that is used for the classification stage.

•
The output space delivered by the detection stage is the set of points or vectors modelling the moving vehicles. • It is important to keep a low dimensional output space of the detection algorithms and/or the use of low-computational complexity features to improve the performance of these real-time systems. • In [36] the occlusion is classified into partial and full visually, and convex regions were employed, reporting an improvement of the detection. However, a metric about the occlusion has not been presented. • In [39] the occlusion handling algorithm is based on SVM, using 11 videos for training and another three for the detection of occlusion. Although this technique is novel, it uses images as elements of the input space for the SVM classifier. Therefore, it has a greater computational complexity than other techniques that use elements of less complexity than those images. • All occlusion management algorithms should be tested with long-duration, high-frame-rate videos, 135-s videos and frame rates of 8 are relatively low.
• Vehicle ROI extraction based on GMM to reduce computational complexity is achieved in some works like [43]. • In our work assumptions such as (1) processing in the pixel domain, (2) tracking and decision at frame-level, (3) the use of low-computational complexity features and (4) processing of pixels in certain regions with high variability, are kept to reduce the computational complexity because these assumptions are crucial for a necessary future parallelization of these algorithms.

•
Our work has the largest number of different scenarios for detection and the largest number of frames. In addition, traffic load and other metrics are given.
In the literature, there are many selected and extracted features [7,9,10,32,[44][45][46] such as: wave length, mean, variance, peak, valley, acreage, acoustic signals, Histogram Oriented Gradients (HOG) features, the vehicle length, Grey-Level Co-occurrence matrix features, low level features, area, width, height, centroid, and bounding box. In the classification stage, these features are employed to classify the vehicles into several classes; the most used are small, medium, and large. Since 2006, SVM has been used for vehicle classification using other input spaces, and other different scenarios, such as static images [47], vehicles on road ramps [10], visual surveillance from low-altitude airborne platforms [7], on-road camera [32], static side-road camera [48], and laser intensity image without vehicle occlusion [46]. Also, in this work we focus on traffic surveillance with only a vision camera as sensor, the scenarios are multilane ways with a relative high traffic load, under different weather conditions and a variable occlusion index (see [49]). Table 2 shows important aspects of the related works in vehicle classification, including our results, and where TPR is the True Positive Rate or Recall, TNR is the True Negative Rate, FNR is the False Negative Rate. • Several systems used in addition to the video camera, other sensors, then different input spaces were created. Consequently, the use of a single static camera helps to maintain a low cost hardware system, and we have demonstrated that it is possible to have a high performance system.

•
The test scenarios used in this work are richer than those presented in related papers.

•
For traffic monitoring in Smart City IoT with a static camera located on the road-side, our system showed the highest performance and we calculated more performance metrics.
Motivation: For an IoT Smart City and particularly for video-based traffic surveillance, to have a very high-performance vision-based system that improves the detection rate of moving vehicles through geometric features and occlusion handling algorithms; the measurement of the occlusion by a metric here called VOI-Vehicle Occlusion Index-and the use of novel classifiers.

The Proposed System
In this paper, we present a system to detect, track, and classify vehicles from video sequences, with a higher performance than related methods in the literature. Figure 1 shows the block diagram of the system. In the training, the models for each class of vehicles are generated, for this, a training video is used. With the models, the classification is performed using OC-SVM. From Table 2 and the here mentioned literature it can be seen that:  Several systems used in addition to the video camera, other sensors, then different input spaces were created. Consequently, the use of a single static camera helps to maintain a low cost hardware system, and we have demonstrated that it is possible to have a high performance system.  The test scenarios used in this work are richer than those presented in related papers.  For traffic monitoring in Smart City IoT with a static camera located on the road-side, our system showed the highest performance and we calculated more performance metrics.
Motivation: For an IoT Smart City and particularly for video-based traffic surveillance, to have a very high-performance vision-based system that improves the detection rate of moving vehicles through geometric features and occlusion handling algorithms; the measurement of the occlusion by a metric here called VOI-Vehicle Occlusion Index-and the use of novel classifiers.

The Proposed System
In this paper, we present a system to detect, track, and classify vehicles from video sequences, with a higher performance than related methods in the literature. Figure 1 shows the block diagram of the system. In the training, the models for each class of vehicles are generated, for this, a training video is used. With the models, the classification is performed using OC-SVM.

System Initialization
The tasks related with the system initialization (see Figure 2) are the following:


Manual selection of the Region of Interest (ROI), which is the set of all pixels where moving objects or vehicles can be detected, tracked and classified. This concept helps to reduce the whole processing time.


Manual setting of the lane-dividing lines, detection line, and classification line.

System Initialization
The tasks related with the system initialization (see Figure 2) are the following: • Manual selection of the Region of Interest (ROI), which is the set of all pixels where moving objects or vehicles can be detected, tracked and classified. This concept helps to reduce the whole processing time.  From Table 2 and the here mentioned literature it can be seen that:  Several systems used in addition to the video camera, other sensors, then different input spaces were created. Consequently, the use of a single static camera helps to maintain a low cost hardware system, and we have demonstrated that it is possible to have a high performance system.  The test scenarios used in this work are richer than those presented in related papers.  For traffic monitoring in Smart City IoT with a static camera located on the road-side, our system showed the highest performance and we calculated more performance metrics.
Motivation: For an IoT Smart City and particularly for video-based traffic surveillance, to have a very high-performance vision-based system that improves the detection rate of moving vehicles through geometric features and occlusion handling algorithms; the measurement of the occlusion by a metric here called VOI-Vehicle Occlusion Index-and the use of novel classifiers.

The Proposed System
In this paper, we present a system to detect, track, and classify vehicles from video sequences, with a higher performance than related methods in the literature. Figure 1 shows the block diagram of the system. In the training, the models for each class of vehicles are generated, for this, a training video is used. With the models, the classification is performed using OC-SVM.

System Initialization
The tasks related with the system initialization (see Figure 2) are the following:


Manual selection of the Region of Interest (ROI), which is the set of all pixels where moving objects or vehicles can be detected, tracked and classified. This concept helps to reduce the whole processing time.

Vehicle Detection
It is known that different techniques can be employed for vehicle detection, e.g., pixel-domain, photo-domain. Vehicle models are built from different sets of the features, which can be geometric, based on secondary sensors, or derived by certain mathematical transformations. We will work at pixel-domain because we observed that several algorithms achieve a high performance and useful for a necessary future parallelization of the algorithms.
Although, the background modelling is not a target of this work, to have a reliable background model is a very important issue for detection of moving objects like vehicles. This problem was addressed and modelled by different authors. Stauffer and Grimson [40] developed the adaptive GMM model, while Power and Schoonees [50] revealed important practical details of this model. Mandellos, Keramitsoglou and Kiranoudis [41], and Huang [34] developed background models. Nevertheless, all of them rely on assumptions that the background pixel-values show higher frequencies and less variance than any foreground pixels. The algorithm in [41] behaves for the background as a GMM-Model, improving the foreground only working on Luv color-space that means that its computational complexity is three times that obtained in gray scale. And, as the Huang-Algorithm doesn't show a high performance, we select the Stauffer-Grimson algorithm.
To select a background-foreground algorithm, we assume: (1) processing in the pixel domain, (2) tracking and decision at frame-level, (3) the use of some techniques to reduce the computational complexity, e.g., low-complexity features, processing of pixels in certain regions with high variability. These issues are crucial for a necessary future parallelization of detection algorithms.
Let V(τ) be a video of a duration τ containing M ground truth vehicles. It can be considered as a sequence of K images or frames indexed by k = 1, 2, . . . , K. And each frame at time k can be seen as a matrix, I k of size (m x n) where each element is a pixel value represented as x k (i, j), and where for the gray-space x k (i, j) ∈ R 1 G and R 1 G ⊂ R 1 and for a 3D color-space In this work, we use only the grayscale, and then the image at frame k is expressed as: G } and the background as: which satisfies some mathematical background criteria. Based on the before mentioned assumptions, the Adaptive GMM [40] was selected to segment the vehicles from the background mask. Each pixel in the image is modeled through a mixture of Z Gaussian distributions. The probability that a certain pixel has a value x at time k can be written as: where ω z,k is an estimate of the weight of the z th Gaussian in the mixture at time k and η is an n-dimensional Gaussian probability density function, with a mean value µ and a covariance matrix ∑: Each pixel value, x k (i, j) at position (i, j) and frame k, that does not match the background, BG k , is used to construct the foreground B k , also: After that, a connected components analysis is performed to group those pixels that model possible vehicles embedded in the input video, and these groups are called blobs in the literature. If a frame k or image contains L groups of possible vehicles or blobs, blob k l , then: Note that, variable l is used to index a possible vehicle and index k for its temporal behavior or frame. Then, for the video V(τ), l = {1, 2, . . . , N}, and where N = M for the ideal case. Any blob is denoted by blob, specific blob indexed by l is denoted as blob l , and temporal instances as blob l k .

Feature Extraction
In our case, the blobs are extracted from the foreground mask, and binary morphological operations (erosion and dilation) are performed to reduce noise and enhance the geometry and shape of the objects. Next, blob analysis is used to extract geometric features such as area (the sum of the connected pixels or spatial occupancy), height, width, and centroid of the bounding box, see Figure 3. Finally, if we select d features as explained in Section 3.6, each blob is mapped to a new point or vector = { ( , )|pixel ( , ) is connected to pixel ( , ), and ⊂ } = 1, … , Note that, variable l is used to index a possible vehicle and index k for its temporal behavior or frame. Then, for the video ( ), l = {1,2,…,N}, and where N = M for the ideal case. Any blob is denoted by , specific blob indexed by l is denoted as , and temporal instances as .

Feature Extraction
In our case, the blobs are extracted from the foreground mask, and binary morphological operations (erosion and dilation) are performed to reduce noise and enhance the geometry and shape of the objects. Next, blob analysis is used to extract geometric features such as area (the sum of the connected pixels or spatial occupancy), height, width, and centroid of the bounding box, see Figure 3. Finally, if we select d features as explained in Section 3.6, each is mapped to a new point or vector ℝ , where ℝ ⊂ ℝ is a new space before occlusion-handling where the vehicle models live. It is important to observe the following notation. Any moving vehicle is referred as ℝ , and its temporal instances at time or frame k by ℝ , specific vehicle indexed by l is denoted as ℝ , and its temporal instances by ℝ .

Occlusion Handling
Due to camera position and height, occlusion occurs, and several errors are generated during the detection stage. The major task of any occlusion-handling algorithm in these scenarios is to minimize effects of the occlusion caused by large vehicles due to the high variance of their feature values. Therefore, we propose a simple algorithm to reduce these occlusion effects. This algorithm is based on the following assumptions: 1. The width of a vehicle cannot be greater than the width of one lane, except when it is a large vehicle that is completely inside the ROI (due to perspective effects), i.e.: if ( > ℎ 1 ) and ( ̅ < ℎ ) → Occlusion 2. The width of a vehicle that is before the detection line cannot be greater than the width of two lanes, even if it is a large vehicle, i.e.: if ( > ℎ 2 ) and (blob is before ) → Occlusion where is the vehicle width (bounding box width), is the lane width, ̅ is the normalized area, is the detection line, and ℎ 1 , ℎ 2 , and ℎ are the thresholds with values 1.22, 2.27, and 0.12, respectively. The values of thresholds were selected using a training video with occluded vehicles; the values that increase the detection rate were selected. If at least one case is fulfilled (Figures 4a,c), then we use the lane-dividing lines to separate vehicles traveling side by side, which are detected as a single object. It is important to observe the following notation. Any moving vehicle is referred as x ∈ R d F, and its temporal instances at time or frame k by x k ∈ R d F, specific vehicle indexed by l is denoted as x l ∈ R d F, and its temporal instances by x l k ∈ R d F .

Occlusion Handling
Due to camera position and height, occlusion occurs, and several errors are generated during the detection stage. The major task of any occlusion-handling algorithm in these scenarios is to minimize effects of the occlusion caused by large vehicles due to the high variance of their feature values. Therefore, we propose a simple algorithm to reduce these occlusion effects. This algorithm is based on the following assumptions: 1.
The width of a vehicle cannot be greater than the width of one lane, except when it is a large vehicle that is completely inside the ROI (due to perspective effects), i.e.,: 2.
The width of a vehicle that is before the detection line cannot be greater than the width of two lanes, even if it is a large vehicle, i.e.,: where w b is the vehicle width (bounding box width), w lane is the lane width, a is the normalized area, D is the detection line, and Th 1 , Th 2 , and Th m are the thresholds with values 1.22, 2.27, and 0.12, respectively. The values of thresholds were selected using a training video with occluded vehicles; the values that increase the detection rate were selected. If at least one case is fulfilled (Figure 4a,c), then we use the lane-dividing lines to separate vehicles traveling side by side, which are detected as a single object. For each blob of : Step 1: Find and +1 for , ( Figure 5).
Step 2: Estimate the lane width at point , , as follows: where ( c ) is the abscissa of the point on the ℎ lane-dividing line with c as the ordinate ( Figure 5).
Step 3: Compute the normalized area as follows: Step 4: Check if there is occlusion using Equations (3) and (4). If at least one case is fulfilled, then draw: where ( , , ) and ( , , +1 ) can be defined as follows: Step 5: If all blobs have been analyzed and at least one lane-dividing line drawn, then extract the features, update the space ′ , and end the algorithm. Otherwise, go to step 1.
The algorithm for occlusion handling considers a static camera, and previous initialization of system, i.e., the lane-dividing lines must be defined. If the camera changes its position, it will be consider as another scenario, then the initialization of system is required. The vehicles are detected For each blob blob k l of B k : Step 1: Find L j and L j+1 for c m,k ( Figure 5).
Step 2: Estimate the lane width at point c m,k , as follows: where x L j (y c ) is the abscissa of the point on the j th lane-dividing line with y c as the ordinate ( Figure 5).
Step 3: Compute the normalized area as follows: Step 4: Check if there is occlusion using Equations (3) and (4). If at least one case is fulfilled, then draw: where d(c m,k , L j ) and d(c m,k , L j+1 ) can be defined as follows: Step 5: If all blobs have been analyzed and at least one lane-dividing line drawn, then extract the features, update the space B k , and end the algorithm. Otherwise, go to step 1.
The algorithm for occlusion handling considers a static camera, and previous initialization of system, i.e., the lane-dividing lines must be defined. If the camera changes its position, it will be consider as another scenario, then the initialization of system is required. The vehicles are detected in an area of approximately 5380 ft 2 , once an object is detected, the algorithm for handling occlusions begins to work.
Challenge: The challenge of any occlusion-handling algorithm in these scenarios is to minimize the effects of occlusion caused by large vehicles due to the high variance of their feature values, delivering a uniform space, which will be the input space for the classification stage.
At this point, we will have the new vehicle space B k expressed as: in an area of approximately 5380 ft 2 , once an object is detected, the algorithm for handling occlusions begins to work. Challenge: The challenge of any occlusion-handling algorithm in these scenarios is to minimize the effects of occlusion caused by large vehicles due to the high variance of their feature values, delivering a uniform space, which will be the input space for the classification stage.
At this point, we will have the new vehicle space ′ expressed as:

Vehicle Occlusion Index
Occlusion is an open issue in this area. Some authors classify it into total and partial and some measurements with the area are given. For vehicle traffic surveillance, under the assumption that the detection algorithm perform well, is important to know how frequent the occlusion is and how well the occlusion algorithm performs its function. As occlusion occurs in short time intervals, the measurements should be realized in the same intervals. For these purposes, we introduce here a Vehicle Occlusion Index (VOI).
The VOI-Index is defined as the ratio of the number of new vehicles detected using the occlusion algorithm and the total number of new vehicles detected during a time interval: where is the interval of time. A = 0 indicates that no new vehicles were detected by the algorithm or that the occlusion was not present in the time interval, while a = 1 indicates that the new vehicles detected by the algorithm were tracked and counted too. The VOI versus time is a measure of the frequency with which the occlusion is present. In Table 3 the average VOI-Index for the studied videos is given, while in Section 5 results of the occlusion handling algorithm and VOI-Index are discussed.
Occlusion handling algorithms and occlusion metrics should be studied taking into account: techniques or methods used e.g., convex regions, SVM classifiers, and geometric feature space, computational complexity, classic performance metrics. In addition, they should be tested with longduration videos and high frame rates, and should be compared with each other.

Vehicle Occlusion Index
Occlusion is an open issue in this area. Some authors classify it into total and partial and some measurements with the area are given. For vehicle traffic surveillance, under the assumption that the detection algorithm perform well, is important to know how frequent the occlusion is and how well the occlusion algorithm performs its function. As occlusion occurs in short time intervals, the measurements should be realized in the same intervals. For these purposes, we introduce here a Vehicle Occlusion Index (VOI).
The VOI-Index is defined as the ratio of the number of new vehicles detected using the occlusion algorithm and the total number of new vehicles detected during a time interval: VOI τ = number of new detected vehicles by the occlusion algorithm total number of new vehicles detected , where τ is the interval of time. A VOI τ = 0 indicates that no new vehicles were detected by the algorithm or that the occlusion was not present in the time interval, while a VOI τ = 1 indicates that the new vehicles detected by the algorithm were tracked and counted too. The VOI versus time is a measure of the frequency with which the occlusion is present. In Table 3 the average VOI-Index for the studied videos is given, while in Section 5 results of the occlusion handling algorithm and VOI-Index are discussed. Occlusion handling algorithms and occlusion metrics should be studied taking into account: techniques or methods used e.g., convex regions, SVM classifiers, and geometric feature space, computational complexity, classic performance metrics. In addition, they should be tested with long-duration videos and high frame rates, and should be compared with each other.

Vehicle Tracking
As the Kalman filter (KF) is an efficient and well known recursive filter that estimates the internal state of a linear dynamic system from a series of Gaussian noisy measurements. In mathematical terms, a linear discrete-time dynamical system embodies the following pair of equations [51]: (1) Process equation where x is the state vector, F is the transition matrix, and ω is the process noise; the subscript k denotes discrete time instant. The process noise is assumed to be additive, white, and Gaussian, with zero mean and the covariance matrix defined by: where the superscript denotes matrix transposition. (2) Measurement equation where z is the measurement vector, H is the measurement matrix, and v is the measurement noise, which is assumed to be additive, white, and Gaussian, with zero mean and the covariance matrix defined by: Since the time of the frame interval is very short, it is assumed that the moving object is in constant velocity within a frame interval. The state in frame k can be represented by the vector: where x c,k , y c,k are the centroid coordinates and v x,k , v y,k are the velocity components. The measurement vector of the system can be represented as: For the whole video and frame by frame the blobs blob l k represented as vector x l k ∈ B k are tracked by the corresponding Kalman filters, resulting vehicle tracking sequences Ts(x) = {x 1 , x 2 , . . . , x k } as output space, where x represent any moving vehicle and x i are its instances.

Feature Selection and Environment for Classification
The detection stage delivers the whole space of tracked objects, i.e., detected vehicles or moving objects, to the classification stage. Also, all object tracking sequences, Ts(x), belong to the input space of the classification stage. As each sequence Ts(x) includes geometric and cinematic features and their temporal behaviors, it is necessary to decide where and/or when the instances are taken for classification. Also, for each moving vehicle x corresponds a temporal sequence Ts(x) = {x 1 , x 2 , . . . , x c , . . . , x k } where x c should be a well-defined instance of its class.
As these moving objects or vehicles are detected in different points of the ROI, the behaviors of the features are highly variable, and the most significant geometric feature-the area-is not sufficient for a good classification (see later Section 4.3). Studying other geometric features, such as the width and height of the bounding box, we observed that these showed a lower variance than the area (spatial occupancy). Particularly, these three features presented a very high variance for large vehicles, but a relatively low variance for midsize and small vehicles, see Figure 6.

Feature Selection and Environment for Classification
The detection stage delivers the whole space of tracked objects, i.e., detected vehicles or moving objects, to the classification stage. Also, all object tracking sequences, Ts(x), belong to the input space of the classification stage. As each sequence Ts(x) includes geometric and cinematic features and their temporal behaviors, it is necessary to decide where and/or when the instances are taken for classification. Also, for each moving vehicle x corresponds a temporal sequence ( ) = { 1 , 2 , … , , … , } where should be a well-defined instance of its class. As these moving objects or vehicles are detected in different points of the ROI, the behaviors of the features are highly variable, and the most significant geometric feature-the area-is not sufficient for a good classification (see later Section 4.3). Studying other geometric features, such as the width and height of the bounding box, we observed that these showed a lower variance than the area (spatial occupancy). Particularly, these three features presented a very high variance for large vehicles, but a relatively low variance for midsize and small vehicles, see Figure 6.
As a class is a subspace of the input space, and inside of each class there are several points, and each point has several instances, is necessary to reduce these intra-class differences. Therefore, we propose for classification:  As a class is a subspace of the input space, and inside of each class there are several points, and each point has several instances, is necessary to reduce these intra-class differences. Therefore, we propose for classification:

1.
Instead of 1D geometric feature space, the use of a 3-D geometric feature space, R 3 ⊂ R d . Then, for the detected vehicles or blobs are used the input points x ∈ R 3 , x = (Area, Width, Width/Height).

2.
Classification is performed in a specific line of the ROI, called here classification line, to reduce intra-class differences of the space of tracking sequences Ts(x) (see Figure 7). 3.
Reduction in the variation of the feature values of any input point by using the average of feature values of the last three instances-detected at k-th frame after the classification line-and projecting them to the classification line, i.e., Proj(x). Challenge: The challenge is to find and select significant and/or invariant features for a very high detection rate and precision under different weather conditions and for several scenarios.

Vehicle Classification
Classification is carried out here based on input space and classifiers: 1. 1D feature input space and thresholds. where ℎ and ℎ are the thresholds for every class with values of 0.12 and 1.2, respectively. For cases 2, 3, and 4, the vehicles are represented by vectors ∈ ℝ 3 , which will be classified through K-means, SVM, and OC-SVM. In the classification employing the OC-SVM algorithm, a model for each class was defined. OC-SVM allows considering different behaviors of the detected blobs belonging to the same class.
OC-SVM [52][53][54][55] maps input data 1 , … , ∈ into a high dimensional space (via Kernel  k(x,y)) and finds the maximal margin hyperplane that best separates the training data from the origin. To do this, the following quadratic program must be solved [52]: Subject to ( ( )) ≥ − ; ≥ 0, ∈ (0,1], where is the normal vector, is a map function → , is the bias, are nonzero slack variables, is the outlier parameter control, and ( , ) =< ( ), ( ) > . The equation is solved through a kernel function and Lagrangian multipliers ∝ , and the solution returns a decision function of: where = ∑ ( ) and ∑ = 1. The kernel function used in this paper is the RBF, ( , ) = ℯ − ‖ − ‖ .
Challenge: The challenge in the classification is to find mathematical classifiers of the hypothesis Challenge: The challenge is to find and select significant and/or invariant features for a very high detection rate and precision under different weather conditions and for several scenarios.

Vehicle Classification
Classification is carried out here based on input space and classifiers: 1.
3D feature input space and K-means. 3.
3D feature input space and OC-SVM.
For case 1, once the estimated area has been computed, the vehicles are classified. The decision rule for classification is defined as: where Th s and Th m are the thresholds for every class with values of 0.12 and 1.2, respectively.
For cases 2, 3, and 4, the vehicles are represented by vectors x ∈ R 3 , which will be classified through K-means, SVM, and OC-SVM. In the classification employing the OC-SVM algorithm, a model for each class was defined. OC-SVM allows considering different behaviors of the detected blobs belonging to the same class.
OC-SVM [52][53][54][55] maps input data x 1 , . . . , x N ∈ A into a high dimensional space F (via Kernel k(x,y)) and finds the maximal margin hyperplane that best separates the training data from the origin. To do this, the following quadratic program must be solved [52]: Subject to (wϕ(x i )) ≥ b − ξ i ; ξ i ≥ 0, υ ∈ (0, 1], where w is the normal vector, ϕ is a map function A → F , b is the bias, ξ i are nonzero slack variables, υ is the outlier parameter control, and k(x, y) = ϕ(x), ϕ(y) . The equation is solved through a kernel function and Lagrangian multipliers ∝ i , and the solution returns a decision function of: The kernel function used in this paper is the RBF, k(x, y) = e −η x−y . Challenge: The challenge in the classification is to find mathematical classifiers of the hypothesis set that allow mapping every point of the input space to the corresponding classes of the output space with minimal error.

Video Processing: Test Environment
In this work, the performance of the proposed system was tested on real traffic videos: three videos, V1, V2, and V3, recorded in Guadalajara, Mexico; two videos (V4, V5) obtained from the GRAM Road-Traffic Monitoring (GRAM-RTM) dataset [56,57] (the video named V4 corresponds to video M-30, and the video named V5 is video M-30-HD); and video (V6, V7, and V8) recorded in Britain's M6 motorway (see [58]).
The resolution of all videos was reduced to 420 × 240 pixels at 25 frames per second and downsampling was performed to decrease the computation time. The camera's field of view was directly ahead of the vehicles. Videos V1, V2, and V3 were recorded with a cell phone at a height of 19.5 ft on the road. This video contains double trailer traffic, which is not present in the other videos. In addition, there is quite a bit of vibration. All image frames were visually inspected to provide the ground truth (GT) dataset for evaluation purposes. Table 3 shows the number of frames in each video, the traffic load, and the place and weather conditions. In addition, more than 61 min of video, 4111 ground truth vehicles, three places in different countries and weather conditions, a traffic load of up to 1.32 vehicles/s with traffic load peaks from 2 to 4 vehicles/s (see Figure 8), and a vehicle occlusion index-VOI-from 0.00 to 0.312.

Video Processing: Test Environment
In this work, the performance of the proposed system was tested on real traffic videos: three videos, V1, V2, and V3, recorded in Guadalajara, Mexico; two videos (V4, V5) obtained from the GRAM Road-Traffic Monitoring (GRAM-RTM) dataset [56,57] (the video named V4 corresponds to video M-30, and the video named V5 is video M-30-HD); and video (V6, V7, and V8) recorded in Britain's M6 motorway (see [58]).
The resolution of all videos was reduced to 420 × 240 pixels at 25 frames per second and downsampling was performed to decrease the computation time. The camera's field of view was directly ahead of the vehicles. Videos V1, V2, and V3 were recorded with a cell phone at a height of 19.5 ft on the road. This video contains double trailer traffic, which is not present in the other videos. In addition, there is quite a bit of vibration. All image frames were visually inspected to provide the ground truth (GT) dataset for evaluation purposes. Table 3 shows the number of frames in each video, the traffic load, and the place and weather conditions. In addition, more than 61 min of video, 4111 ground truth vehicles, three places in different countries and weather conditions, a traffic load of up to 1.32 vehicles/s with traffic load peaks from 2 to 4 vehicles/s (see Figure 8), and a vehicle occlusion index-VOI-from 0.00 to 0.312. The system was implemented in MATLAB and tested on an Intel Core i7 PC, with a 3.40 GHz CPU and 16 GB RAM. The metrics used to characterize the system performance in different stages are the same, i.e.: The system was implemented in MATLAB and tested on an Intel Core i7 PC, with a 3.40 GHz CPU and 16 GB RAM. The metrics used to characterize the system performance in different stages are the same, i.e.,: where TP, FP and FN have different interpretations depending on the stage where they are used. In the detection stage: • GT in the video is the ground truth or input space, • TP is the number of vehicles successfully detected, • FP is the number of false vehicles detected as vehicles, • FN is the number of vehicles not detected, • GT' is the output space or the set of all points detected as moving vehicle, then GT' is greater than GT.
In the classification stage, for the classes S small, M midsize and L large vehicles: Any point x ∈ FN(class i) will be classified into another class j, j = i; then this point will be seen as FP(class j), and consequently: where FP i (class j) are the elements of class i classified as belonging to class j, j = i: Consequently, for each class i, we will have their associated metrics e.g., DR(class i), Precision (class i) and F-measure (class i), which have generally different numerical values from one to another class, see Table 4, class S, M or L of any video: But, for the classifier with all classes we have: FP(class i) (28) and from Equations (23) and (24): Then, the following metrics, although with different physical meanings, are numerically equal each other, i.e., (Equations (19), (20) and (21), and see Table 5, for all classes of any video: The most significant metrics are detection rate or recall for the detection stage and F-measure for the classification stage, because it works on the complete input space for these scenarios, i.e., the space including TP, FP and FN-see Equations (19), (20) and (21).  Table 4 shows the experimental results of the detection stage using the occlusion algorithm. Experimental results show that the detection stage without the occlusion-handling algorithm has a detection rate of 83.793% (see Table A1), while that using the occlusion-handling algorithm in the detection stage improves the detection rate by 11.423%, and the number of vehicles detected increased to 95.216%. During the detection stage of these videos, a very strong correlation was found between F-measure and the measured VOI index.

Vehicle Detection Results
FP are produced by various conditions: camera locations with high vibration, camera angle, certain morphological operations embedded in the detection algorithm and because the occlusion algorithm divides large blobs into two or smaller ones, and some of them are not vehicles, i.e., FP. Particularly, videos V1, V2, and V3 were recorded in Mexico, where very large vehicles can transit, and the locations of the cameras showed a high vibration. The V4 and V5 videos were recorded in Madrid, Spain, showing a VOI index equal to 0 and the lowest FP numbers. While V6, V7, and V8 with a VOI index close to 0.2 showed results considered normal. These results show that it is necessary to improve the implemented occlusion handling algorithm, using other methods such as the convexity of the blobs and techniques such as K-means and SVM.

Vehicle Classification Results
The LIBSVM library [59] was used to implement the OC-SVM-and SVM-classification with a RBF Kernel. Additionally, for comparison purposes, K-means algorithm was implemented. Figure 9 shows one example for every vehicle class. the convexity of the blobs and techniques such as K-means and SVM.

Vehicle Classification Results
The LIBSVM library [59] was used to implement the OC-SVM-and SVM-classification with a RBF Kernel. Additionally, for comparison purposes, K-means algorithm was implemented. Figure 9 shows one example for every vehicle class.  Table 5 shows the experimental results of the classification stage (with occlusion handling in the detection stage) using OC-SVM and the three selected features (area, width, relHW), where S, M, and L denote small, midsize, and large vehicles, respectively. Table 6 shows the experimental results of videos V6, V7, and V8 in the classification stage (with occlusion handling in the detection stage) using the thresholds, K-means, SVM and OC-SVM and the three selected features (area, width, relHW), where S, M, and L are small, midsize and large vehicles, respectively.  Table 5 shows the experimental results of the classification stage (with occlusion handling in the detection stage) using OC-SVM and the three selected features (area, width, relHW), where S, M, and L denote small, midsize, and large vehicles, respectively.  Table 6 shows the experimental results of videos V6, V7, and V8 in the classification stage (with occlusion handling in the detection stage) using the thresholds, K-means, SVM and OC-SVM and the three selected features (area, width, relHW), where S, M, and L are small, midsize and large vehicles, respectively.  190 Experimental results show that the performance of the classifiers increases when using three geometric features. In addition, SVM and OC-SVM classifiers have better performance than K-means. By using a single geometric feature, e.g., area, the recall and particularly the F-measure were 77.322%. However, using 3D feature input space and OC-SVM, the F-measure achieved a value of 98.190%.

Test Environment
Eight videos with 4111 manually labelled ground truth vehicles and a duration of more than 61 min, three places in different countries and under different weather conditions, a mean traffic load of up to 1.32 vehicles/s with traffic load peaks from 2 to 4 vehicles/s (see Figure 8), and a vehicle occlusion index of up to 0.312. The system performs well and in real time under all these scenarios.

Occlusion Handling Algorithm and VOI-Index
As multiple vehicles will be detected as one due to perspective effects or shadows, an algorithm to reduce this occlusion was implemented. This algorithm allows improving the detection rate from 83.793% to 95.216% (see details in Table A1). FP are produced by various conditions: camera locations with high vibration, camera angle, certain morphological operations embedded in the detection algorithm and because the occlusion algorithm divides large blobs into two or smaller ones, and some of them are not vehicles, see Section 4.2, for details about videos V1-V7. From Tables 3 and 4 we can conclude that a VOI-Index = 0 doesn't mean that the number of FN is equal to 0, but indicates us that the algorithm for detection of moving vehicles should be improved.

Clustering Analysis
Clustering analysis, e.g., K-means, SVM, OC-SVM, was employed to classify the vehicles into three classes: small, midsize, and large. The use of these algorithms in the classification stage allows considering all variations in the geometric vehicle features observed in the training data.

SVM and OC-SVM
SVM and OC-SVM were the classifiers with the best performance; OC-SVM achieved a global recall and an F-measure of up to 98.525%, and a F-measure of 99.211% for medium size vehicles of video V6. The authors consider that the performance differences between SVM and OC-SVM are due to the parameters selected. In this work, the values of parameter C and η used to evaluate the SVM classifier are {1, 5, 36} and {0.5, 0.65, 0.95}, respectively. The parameter values for evaluating OC-SVM, i.e., η and υ, are {1, 10.5, 15} and {0.001, 0.01, 0.1}, respectively. The misclassification cases were due to unsolved occlusions in the detection stage, particularly in those cases where the vehicles move bumper-to-bumper. In future work, we will consider improving detection with a more efficient occlusion algorithm and other methods for background formation.
Behaviors with variations in the perspective views can be observed in video V2 and V3, where although the camera position changed 20 ft, only the models generated from video V2 were used for the classification stage of both videos, indicating that for certain lateral displacement of the camera, the algorithm is robust. In the K-means algorithm, the value of K = 3. Due to the short length of the training data for small vehicles, the K-means centroids may be biased; thus, the mean of each geometric feature was computed previously, and this information was passed as input to the K-means algorithm.

3-D Geometric Feature Space
With the use of Area, Width, and Width/Height ratio of the bounding box-the classification performance was improved with respect to that using only one feature: the area (see Table 6). The geometric features are extracted directly of detected blobs; therefore, the computational cost is lower than those achieved with other features proposed in the state-of-the-art, like grey-level co-ocurrence matrix, texture coarseness, or Histogram of Oriented Gradients.

Real Time Processing
The average time to process one image frame in our system is less than 30 ms, which proves that our approach can run in real time for videos at 25 fps, and with an average-traffic load of 1.32 vehicles per second and peaks of 4 vehicles per second. In general, the higher the traffic load-particularly with large size vehicles-the higher the measured congestion is the vehicle occlusion index.
In this paper, a high-performance computer vision system is proposed for vehicle detection, tracking, and OC-SVM classification, which has the following advantages:

1.
For the GMM based detection stage, the system does not require sample training and camera calibration.

2.
Except for ROI, lane-dividing lines, the detection line, and the classification line, it requires no other initialization.

3.
A proposed simple algorithm reduces occlusions, particularly in those cases where vehicles move side by side.

4.
The use of OC-SVM and a 3D geometric feature space for the classification stage.

Conclusions
A very high-performance vision system with a single static camera, suitable for an IoT Smart City, for front-and rear-view moving vehicle detection, tracking, counting, and classification was achieved, implemented, and tested. The number and quality of employed metrics outperforms those used in most comparable papers.
The vehicle occlusion index defined here is a measure of how frequent the occlusion is, and how well the occlusion-handling algorithm performs its function. Our results support that the lower the VOI-Index, the better the performance of the algorithms for detection and classification.
Experimental results showed that our system performs well in real time with an average traffic flow of 1.32 vehicles per second and traffic load peaks from 2 to 4 vehicles/s on a three-lane road. A mean processing time of about 75% between two consecutive frames was achieved. The best classifiers were with SVM, where OC-SVM with a RBF Kernel successfully classified the vehicles with a high performance, e.g., recall, precision, and F-measure of up to 98.190%; and up to 99.051% for the midsize class.
The high performance of this system is due to the use of a 3D geometric feature space with side-occlusion handling as an output space of the detection stage (input feature space for the classification), the use of OC-SVM with a RBF Kernel in the classification stage, and the classification is performed in a specific line of the ROI to reduce intra-class differences of the input space.
Finally, an extensive test environment is available for researchers. It has eight videos with 4111 manually labelled ground truth vehicles and a duration of more than 61 min, three places in different countries and under different weather conditions, a mean traffic load of up to 1.32 vehicles/s with traffic load peaks from 2 to 4 vehicles/s (see Figure 8), and a vehicle occlusion index of up to 0.312.
Open Issues remaining after this study include: • Develop algorithms for the formation of background with different color spaces and updating is crucial for the different stages of traffic surveillance.

•
Develop algorithms for automatic detection of the ROI and the lane-dividing lines. • Improve algorithms for occlusion caused by high traffic loads, particularly for large vehicles, to increase the detection rate and, consequently, decrease variance of the values of points belonging to the input space for tracking and classification, and to characterize the occlusion by metrics.

•
Due to the number of features associated with this problem and the variance of intra-class and interclass feature values, the determination of the optimal number of classes for classification remains an open issue.

Appendix A
Links to the video processing files uploaded at YouTube. Appendix B Table A1 shows the comparison between the results using an occlusion algorithm in the detection stage or not for videos V6, V7, and V8.  Table A2 shows the confusion matrix obtained in the classification stage of videos V6, V7, and V8; (a-d) are the confusion matrix of the threshold, K-means, SVM, and OC-SVM methods, respectively.