Vehicle Counting Based on Vehicle Detection and Tracking from Aerial Videos

Vehicle counting from an unmanned aerial vehicle (UAV) is becoming a popular research topic in traffic monitoring. Camera mounted on UAV can be regarded as a visual sensor for collecting aerial videos. Compared with traditional sensors, the UAV can be flexibly deployed to the areas that need to be monitored and can provide a larger perspective. In this paper, a novel framework for vehicle counting based on aerial videos is proposed. In our framework, the moving-object detector can handle the following two situations: static background and moving background. For static background, a pixel-level video foreground detector is given to detect vehicles, which can update background model continuously. For moving background, image-registration is employed to estimate the camera motion, which allows the vehicles to be detected in a reference coordinate system. In addition, to overcome the change of scale and shape of vehicle in images, we employ an online-learning tracker which can update the samples used for training. Finally, we design a multi-object management module which can efficiently analyze and validate the status of the tracked vehicles with multi-threading technique. Our method was tested on aerial videos of real highway scenes that contain fixed-background and moving-background. The experimental results show that the proposed method can achieve more than 90% and 85% accuracy of vehicle counting in fixed-background videos and moving-background videos respectively.


Introduction
With the rapid development of intelligent video analysis, traffic monitoring has become a key technique for collecting information about traffic conditions. Using the traditional sensors such as magnetometer detectors, loop detectors, ultrasonic sensors, and surveillance video cameras may cause damage to the road surface [1][2][3][4]. Meanwhile, because many of these sensors need to be installed in urban areas, the cost of this work is high. Among them, surveillance video cameras are commonly used sensors in the traffic monitoring field [5][6][7], which can provide video stream for vehicle detection and counting. However, there are many challenges for using surveillance video cameras, such as occlusion, shadows, and limited view. To address these problems, Lin [8] resolved the occlusion problem with occlusion detection and queue detection. Wang [9] detected shadows based on shadow characteristics such as lower lightness and the lack of textures. Douret [10] used a multi-camera method to cover large areas for avoiding occlusion. In [11], two omnidirectional cameras are mounted on vehicle, performing binocular stereo matching on the rectified images to obtain a dynamic panoramic surround map of the region around the vehicle. Further, many researchers apply vehicle detection and tracking to vehicle counting. Srijongkon [12] proposed a vehicle counting system based on ARM/FPGA processor, which uses adaptive background subtraction and shadow elimination to detect moving vehicle and then counts the vehicles in video screen. Prommool [13] introduced a vehicle counting framework using motion estimation (block matching and optical flow combination). In [13], the area box is set at the intersection to determine whether the vehicle passes through. Swamy [14] presented a vehicle detection and counting system based on color space model, which uses color distortion and brightness distortion of image to detect vehicle and then counts vehicle using a pre-defined line. Seenouvong [15] used background subtraction method to detect foreground vehicles in surveillance video sequence and calculates centroid of objects in virtual detection zone for counting. Although researchers improve the traditional traffic monitoring methods and apply vehicle detection and tracking on vehicle counting, the traditional surveillance camera still cannot be applied for monitoring large areas, which is very restrictive for vehicle counting. Besides, the research in [12][13][14][15] does not use a multi-object management system to confirm the uniqueness of vehicles, which is unreliable for long sequence vehicle counts and effects the efficiency of vehicle counting. In contrast to traditional traffic monitoring sensors, the UAV can be flexibly deployed to the regions that need to be monitored. Moreover, the UAV is a cost effective platform that can monitor a large continuous stretch of roadway and can focus on a specific road segment. In addition, to focus on large area monitoring, the UAV provides a wider top-view perspective. By achieving a large top-view perspective, the UAV can provide efficient data acquisition for intelligent video analysis.
In recent years, several methods for traffic monitoring from aerial video are presented. Ruimin Ke [16] developed an approach for vehicle speed detection by extracting interest points from a pair of frames and performs interest-point tracking from aerial videos by applying Kanade-Lucas optical flow algorithm. Shastry [17] proposed a video-registration technique for detecting vehicles using KLT (Kanade-Lucas-Tomasi) features tracker to automatically estimate traffic flow parameter from airborne videos. Cao [18] proposed a framework for UAV-based vehicle tracking using KLT features and a particle filter. The research in [16][17][18] uses KLT features tracker which detects moving-object by extracting optical flow of interest points and can be used in case of moving background. The interest points are used to efficiently extract the feature of interest region, which can reduce the amount of computation. However, because of the complexity of scene in aerial videos, some background points may be extracted as interest points, which brings noises to the subsequent tracker. Pouzet [19] proposed a real-time method for image-registration dedicating to small moving-object detection from a UAV. The techniques in [17,19] are both equipped with image-registration, which segments the moving vehicles by transforming the previous frame to the current frame from aerial videos. Image-registration allows for the comparison of the images in a reference frame, so the scene can be analyzed in a reference coordinate system. Freis [20] described an algorithm for the background-subtraction based vehicle-tracking for vehicle speed estimation using aerial images taken from a UAV. Chen [21] proposed a vehicle detection method from UAVs which integrated of Scalar Invariant Feature Transform(SIFT) and Implicit Shape Model(ISM). Guvenc [22] proposed a review paper for object detection and tracking from UAVs. Shi [23] proposed a moving vehicle detection method in wide area motion imagery, which constructs a cascade of support vector machine classifiers for classifying object and can extract road context. Further, LaLonde [24] proposed a cluster network for small object detection in wide area motion imagery, which combines both appearance and motion information. However, the research in [23,24] focuses on small object detection in wide area motion imagery which is captured at very high altitude and is hard to capture with an universal UAV. For vehicle counting from UAV, Wang [25] proposed a vehicle counting method with UAV by using block sparse RPCA algorithm and low rank representation. However, the UAV only works on hovering mode and captures static background images. In addition, without a multi-object tracking and management module, the method cannot distinguish the direction and uniqueness of the vehicle, which can easily lead to counting error.
In this paper, a multi-vehicle detection and tracking framework based on UAV is proposed, which can be used for vehicle counting and can handle both fixed-background and moving-background. First, the UAV collects the image sequence and transmits it to the detector which is divided into two parts: static background and moving background. To confirm the unique identity of the vehicles in long sequence video, all detected vehicles are tracked by the tracker. To manage the tracked vehicles efficiently and avoid tracking chaos, we design a multi-object management module which manages the tracked vehicles under a unified module and provides status information of each tracked vehicle for subsequent intelligent analysis. In addition, to improve the computational efficiency, we incorporate parallel processing technology into the multi-objective management module. In summary, our method mainly includes four components: moving-vehicle detection, multi-vehicle tracking, multi-vehicle management module, and vehicle counting.
The rest of this paper is organized as follows. Section 2 describes the architecture of the vehicle counting system from aerial videos. Section 3.1 focuses on vehicle detection. Sections 3.2 and 4 mainly discuss the tracking algorithm framework and multi-object management module. Section 5 mainly introduces the vehicle counting module. In Section 6, we present the experimental results. Finally, we give a brief conclusion in Section 7.

Architecture of the System
Our framework is based on the platform of UAV. In Figure 1, the video stream is captured by a camera mounted on the UAV. The detector can deal with two situations(static background and moving background). We distinguish whether the background is moving or not according to the motion mode of the UAV. For static background, a samples-based algorithm for background subtraction is employed in our framework, which can detect moving vehicle by modeling background and can update the background model continuously. By updating the background model, the parameters of model are more suitable to describe the real-time scene. For moving background, the camera platforms move with UAV. In this case, image-registration is used in our framework to transform the camera coordinates of adjacent frames to a reference coordinate system. Thus, the movement of camera can be compensated in the adjacent frames, so that we can detect vehicles from the reference frame. Images captured by UAV are characterized by complex background and variable vehicle shape, which leads to discontinuity of detector, and thus affects the accuracy of vehicle counting. Thus, to address this problem, an online-learning tracker is used in our framework, which can update the samples used for training. Further, considering that traditional tracker can only track one object, we design an efficient multi-object management module by using multi-threading technology, which can assign multi-object tracking task to parallel blocks and can analyze and validate the status of the tracked vehicles. Finally, the status of the tracked vehicles is used to count the number of vehicles.

Static Background
Vehicle detection is an essential process for vehicle counting. In this section, we mainly discuss how the UAV works in hovering mode. In the case of fixed background, we can extract moving vehicles by using background modeling. ViBe [26] for vehicle detection is employed in our framework with the following advantages. One of the advantages of the ViBe foreground detection algorithm is that the background model can be updated. By updating the background model, the noise points caused by slight variations of brightness can be effectively suppressed in images. Another advantage is that ViBe first selects a certain area in image for background modeling, rather than modeling the entire image, which greatly reduces the computational load.
An overview of ViBe algorithm is given in Figure 2. The first step of ViBe is to initialize the background. Each background pixel is modeled by a collection of N background sample values We randomly select the pixel values of its neighbours as its modeling sample values. To classify the pixel v(x), a difference D between pixel values in the field centered at the point v(x) is defined. The value of D for gray image is defined in Equation (1): and D for RGB image is The v i in Equation (1) is a background sample value. In Equation (2), the center pixels v r (x), v g (x), and v b (x) correspond to three channels. The v ri , v gi , and v bi are background sample values corresponding to three channels. We use the gray-scale image as an example to analyze the principles of the algorithm. Here, three parameters about pixels classification are defined. D t is the pixel difference min threshold. S and S t are the number of points above the pixel difference min threshold D t and the min value of S. If S > S t , the point v x is classified into background. To improve the detection performance on moving objects under background changes, an updating model method is employed. In the method, the probability of updating each background point is 1 ϕ. The probability of updating neighbour's points is 1 ϕ. Updating the neighbour's sample pixel values takes advantage of the spatial propagation characteristics of the pixel values. Then, the background model gradually diffuses outwards. When a pixel is judged to be a background point, the probability of updating the background model is 1 ϕ. In general, updating process is composed of three steps: randomly selecting the sample update, randomly deciding whether to update the background model, and randomly deciding whether to update the field pixels.

Moving Background
In this section, we mainly discuss how the UAV works in moving mode. The overview of this section is shown in Figure 3. SURF feature [27] points are extracted to describe the features of frames. We use fast approximate nearest neighbour search approach to match the feature points. We aim at finding a transformation W which can warp the image I t to image I t+1 . We assume the eight-parameter transformation W is the following: where the m 1 , m 2 , m 3 , m 4 , m 5 , m 6 , m 7 , and m 8 are parameters of warping. We can define the final transformation formula as follows: where (x, y) and (x , y ) are pixel points on I t and the warped image I , respectively. To estimate W, we assume that the transformation between frames can be modeled by a homography and use the Random Sample Consensus (RANSAC) algorithm [28]. Moving background detector. SURF feature points are extracted firstly, which are used to match two frames. RANSAC algorithm is employed to estimate the transformation between the two frames. After that, we transform the camera coordinates of adjacent frames to a reference coordinate system. Then, image difference method is used to extract foreground. The final results are processed by morphological method.
After estimating the warped image I , we use the image difference method to extract the moving vehicle, where the δ denotes the pixel difference value of each point of image I. We set µ as the threshold of δ. If δ > µ, we determine that the point (x, y) is foreground point. To partially suppress the ghost problem, we conduct morphological post-processing for foreground objects. During this process, foreground objects are dilated and then eroded to suppress noise around the object.

Vehicle Tracking
Compared with other tracking-by-detection methods, such as TLD [29] and STRUCK [30], the speed of KCF [31] has been greatly improved. Because of the complexity of ground conditions in UAV videos, the multi-scale and shape changes of vehicles will affect the effect of tracker. To address this issue, we employ the online-learning tracker, which considers the process of tracking as a ridge regression problem and trains a detector in tracking process. The detector is used to detect the location of the object in the next frame. During training, the inputs are samples and labels, such as (x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n ). To determine the label value y i , which is a number in [0, 1], we calculate the distance between the object center and the sample center. If the sample is close to the object, y i tends to 1, and if not tends to 0. The goal of training is to find a function f (z) = w T z, z = [z 1 , ..., z n ] that minimizes the squared error over samples, where λ is a regularization parameter that controls over-fitting.
The KCF tracking process can be mainly divided into the following steps. First, for frame t, a classifier is trained using the tracker samples selected near the prior position P t , which calculate the response of a small window sample. Then, in frame t + 1, samples are obtained near the previous frame's position P t , and the response of each sample is judged by the classifier trained in frame t. The strongest response of the sample is the predicted position P t+1 . As shown in Figure 4, in frame t, the red dashed box is the initial tracking box which is expanded by a factor of 2.5 as a prediction box (blue). The black boxes around the object are sample boxes obtained after the blue box has been cyclically shifted. We use these sample boxes to train a classifier. In frame t + 1, we first sample in the predicted area, that is, the blue solid-line box area. Then, we use the classifier to calculate the responses of these boxes. Obviously, the No. 1 box receives responses the most. Thus, we can predict the displacement of the object.  There are two options for extracting the features of object: one is the gray feature and the other is the HOG feature [32]. Here, we use the HOG feature.

Multi-Vehicle Management Module
Considering that the original KCF tracker can only track one object, we design a multi-vehicle management module by using multi-threading technology, which can assign multi-object tracking task to parallel blocks and can efficiently analyze and validate the status of the tracked vehicles. We assume that the objects detected by the detector are O 1 , O 2 , ..., O n . As shown in Figure 5, first, the detection results are all given to the tracker for initialization. We present the initialized objects as O i 1 , O i 2 , ..., O i n . After that, the detected vehicles O 1 , O 2 , ..., O n in each frame are given to the new object module to determine. We describe the new object module with n = 2. As shown in Figure 6, in frame t, we detect two new blobs O 1d and O 2d represented by green ellipses. In frame t + 1, we use two yellow ellipses to represent the two blobs O 1t and O 2t that have been tracked. In frame t + 2, by analyzing the overlap between the detection box and the tracking box, a new blob O new can be determined by new object module. In our experiments, we use γ to indicate the overlap ratio between tracking box and detection box. If γ < 0.1, the new blob will be added to the tracker. We denote the final tracked objects as O t 1 , O t 2 , ..., O t m . In fact, it can be time consuming for algorithm to handle multiple objects. To address this problem, we design a multi-objective processing mode of recurrent multi-thread. Each object can be automatically assigned to a separate thread to process. At the same time, the system allocates separate memory space to each object. If the target disappears, the algorithm automatically retrieves the thread and the corresponding memory space, which are provided for subsequent new object to use. In this way, threads can be allocated and reclaimed in parallel, which can deal with multiple objects efficiently. In Figure 7, the results of detector O 1 , O 2 , ..., O n are processed by different trackers that are handled by different threads. S threads are divided into one block and the whole thread network is composed of multiple thread blocks. By applying the multi-threading technology, the computational load is greatly reduced. In the multi-vehicle management module, all errors of trackers are analyzed according to the response of regression to avoid these errors in the future. We define the response of regression as F ti , where i is the blob number and t is the frame number. During tracking, the average regression response of the blob i can be expressed as following, where N is the current total number of frames. We define the confidence threshold of the blob as σ.
If F i > σ, blob i will be tracked continuously. If F i ≤ σ, blob i will be reinitialized by detector. The final tracked results are used to count vehicles. We mainly discuss the vehicle counting module in the next section. Figure 6. New-object identification. Two blobs (green) are detected in frame t. In frame t + 1, two blobs (yellow) are tracked. Then, a blob (green) is classified as new blob in frame t + 2, which will be added to tracker.

Vehicle Counting Module
The commonly used vehicle counting method is based on the regional mark and the virtual test line. The former method is to count the number of connected areas, while the latter sets up a virtual test line on the road. We define an area that is used to count vehicles. We count the vehicles in the area below the red line. On the highway, we divide the vehicles into two directions as shown in Figure 8. Because our method is equipped with multi-vehicle tracking and management modules, there is no need to set up multiple lines in the area to determine whether the vehicles are passing. In the multi-vehicle management module, the information of ID and direction of the vehicles are stored, which can be used to directly count the vehicles in the region. For example, we assume the number of vehicles tracked at frame t is m. If a vehicle is tracked at frame t + 1 with a different ID, then we determine the counter plus 1.
In summary, the proposed vehicle counting method is based on the multi-object management module assembling the detectors and trackers to work together in an unified framework. Each tracker tracks an independent object with no interference between the objects, which ensures that the status information of each object is not confused. When the result of the tracker is unreliable, the detector reinitializes the corresponding tracker. In terms of multiple tracker processing, we employ multi-threading technology, which can greatly reduce the computational load.

Evaluation
In this section, we provide the results of a real-world evaluation of our method. The method was implemented with C++ and OpenCV. We tested our algorithm on a system with an Intel Core i5-4590-3.30 GHz CPU, 8G memory and Windows 10 64-bit operating system.

Dataset
In our experiments, we used a UAV to record the real freeway scene videos at 960 × 540 resolution. The data were divided into two groups, one was the height of 50 m and the other was the height of 100 m. The flight modes of the UAV were set to two types in our experiments, static hovering and linear horizontal flight. Table 1 shows the details of the test videos.

Estimation Results and Performance
For static background, the moving vehicles were detected from each frame using the Vibe algorithm. The settings of parameters in our experiments are displayed in Table 2. The first parameter we set is the number of samples N, which is related to the resolution of the image and the average size of the vehicles. Thus, if N is set too small, many background points will be mistakenly detected as a foreground points. Some noises that are not vehicles will be detected as vehicles, because N affects the background model and the sensitivity of the model. On the other hand, if N is too large, the processing speed will be reduced. The parameters min match value S t and the pixel difference min threshold D t are also related to the model and affect the sensitivity of the model. The last parameter update factor ϕ determines the updating speed of the background, which is inversely proportional to the updating speed of the background. An example showing the influence of these parameters is presented in Figure 9. Comparing Figure 9a,b, we can note that the smaller parameter N resulted in many noise points in the background. Moreover, the larger value of S t also led to many noise points in background, as shown in Figure 9c,d. Obviously, the parameters of detector affect the results of detection. Further, we set different parameters to test the accuracy of the vehicle counting on TEST_VIDEO_1. Figure 10 shows the effect of different parameter settings on accuracy. In Figure 10, we can find that when N is 50 and S t is 2, the highest precision is achieved. Hence, setting proper parameter values is important to accuracy. Figure 11 shows that after the morphological processing, the results of segment are more complete.  For moving background, H denotes the threshold of the response of the determinant of Hessian matrix. D min is the distance threshold of matching point. To analyze the effect of parameter setting on accuracy, we tested the vehicle counting accuracy on TEST_VIDEO_5 with different parameter settings. Figure 12 shows the effect of µ and D min on accuracy. Because µ represents the threshold of image segmentation, the results of segmentation are greatly affected, which in turn will affect the accuracy of the vehicle counting. In Figure 12b, the value of D min greatly affects the precision of counting, especially when the value of D min is too high. This is because D min controls the search range of matching points, which directly affects the accuracy of image registration. An example of the result of matching is shown in Figure 13. In Figure 14, the warped image is overlaid on the reference image. To detect foreground, we calculated the difference between the warped image and the reference image. The results of vehicle detection are shown in Figure 15.   During tracking, the confidence threshold of blob σ, area-expansion factor padding and the cell size of the HOG feature cell need to be set, which are shown in Table 2. To test the performance of tracker in tracking multiple objects, we recorded the processing speed of tracking 1-50 objects simultaneously. In Figure 16, the parallel tracker shows obvious superiority relative to traditional sequential processing in terms of processing speed.
The summary of the estimation results and performance are presented in Table 3. We used eight videos at the height of 50 m and eight videos at the height of 100 m to test our method. The used test videos are described in Section 6.1. As shown in Figure 17, we selected six frames from the test videos to show the final results, which were collected at heights of 50 m and 100 m. Test videos were manually examined frame by frame to obtain the ground-truth values of vehicle counting. ε denotes the accuracy rate, ε = N estimated N truth × 100%, (8) where N estimated and N truth denote the estimated value and ground truth. In Table 3, the average error rate for the height of 50 m are less than those for the height of 100 m, because some small objects are regarded as background by detector. The accuracy of the static background is higher than the accuracy of the moving background, which indicates that the error of the estimation of camera motion can affect the results of vehicle detecting and the final results. By considering the results of the analyses above, we can conclude that our method works well on both moving-background aerial videos and fixed-background aerial videos and can achieve more than 90% and 85% accuracy of vehicle counting, respectively.

Conclusions
In this paper, an efficient vehicle counting framework based on vehicle detection and tracking from aerial videos is proposed. Our method can handle two situations: static background and moving background. For static background, we employ a foreground detector which can overcome the slight variations of real scene by updating model. For moving background, image-registration is used to estimate the camera motion, which allows detecting vehicle in a reference frame. In addition, to address the change of shape and scale of vehicle in images, an online-learning tracking method is employed in our framework, which can update the samples used for training. In particular, we design a multi-object management module which can connect the detector and the tracker efficiently by using multi-threading technology and can intelligently analyze the status of the tracked vehicle. The experimental results of 16 aerial videos show that the proposed method yields more than 90% and 85% accuracy on fixed-background videos and moving-background videos, respectively.