A Multi-Class Multi-Movement Vehicle Counting Framework for Trafﬁc Analysis in Complex Areas Using CCTV Systems

: Trafﬁc analysis using computer vision techniques is attracting more attention for the development of intelligent transportation systems. Consequently, counting trafﬁc volume based on the CCTV system is one of the main applications. However, this issue is still a challenging task, especially in the case of complex areas that involve many vehicle movements. This study performs an investigation of how to improve video-based vehicle counting for trafﬁc analysis. Speciﬁcally, we propose a comprehensive framework with multiple classes and movements for vehicle counting. In particular, we ﬁrst adopt state-of-the-art deep learning methods for vehicle detection and tracking. Then, an appropriate trajectory approach for monitoring the movements of vehicles using distinguished regions tracking is presented in order to improve the performance of the counting. Regarding the experiment, we collect and pre-process the CCTV data at a complex intersection to evaluate our proposed framework. In particular, the implementation indicates the promising results of our proposed method, which achieve accuracy around 80% to 98% for different movements for a very complex scenario with only a single view of the camera.


Introduction
Traffic flow analysis is an important fundamental for urban planning and management of the Intelligent Transportation System (ITS). Recently, the advanced technologies in ITS (e.g., connected vehicles, edge computing, and wireless sensor networks) have enabled huge volumes of traffic data that are available from a variety of sources to provide smart traffic control [1]. However, when the connected environment is still far from reality and developing Wireless Sensor Network (WSN) faces expensive cost and transmission problems, analyzing traffic flow from low-cost video surveillance (CCTV) systems becomes a promising solution [2]. Specifically, by monitoring traffic flow from CCTV, we able to evaluate and verify the performance of the system. Moreover, various applications can be applied based on vehicle detection and tracking using machine vision techniques such as vehicle re-identification (Reid), vehicle classification, and abnormal detection [3]. In the case of vehicle monitoring, the video-based system is able to track the different movements of vehicles by a monocular camera instead of developing multiple sensors locating in each direction of the surveillance systems (e.g., loop detectors). Consequently, video-based vehicle counting becomes a key technique for traffic analysis in complex areas [4,5].
In this paper, we present a practical approach for traffic flow analysis based on data from CCTV systems by proposing a comprehensive vehicle counting framework. Specifically, analyzing traffic flow based on CCTV systems by using computer vision techniques has recently attracted more attention; however, this research field faces many challenges such as the following: • Tracking moving vehicles is difficult because of the high similarity of vehicle features, heavy occlusion, a large variation of viewing perspectives, and the low resolution of input videos [6].

•
Determining more detail of traffic patterns such as the type of vehicles and turning volume still comprises open research issues, especially in the case of scenarios that include multiple movements (e.g., intersections or roundabouts) [7].

•
The scalability of monitoring vehicle movements is a critical problem for turning volume analysis; therefore, it requires a common method that can be applied in various scenarios [8].
In order to address the aforementioned problems, in the proposed vehicle counting framework, we present a tracking-by-detection paradigm for vehicle tracking by adopting state-of-the-art methods. Then, a distinguished region approach is proposed for tracking vehicles to improve counting performance. Specifically, instead of focusing on long-time range tracking of vehicles, we are able to divide the considered scenarios into sub-distinguished regions for vehicle trajectories. Generally, the contributions of this paper are summarized as follows: • An effective vehicle tracking method to avoid the switch ID and occlusion problems of vehicle tracking, especially in case of heavy occlusion, different lighting and weather conditions. Specifically, state-of-the-art detection and tracking methods are integrated into our framework. • A comprehensive vehicle counting framework with Multi-Class Multi-Movement (MCMM) counting is presented for analyzing traffic flow in urban areas.

•
We collect, pre-process, implement and establish CCTV data at a certain urban area in order to evaluate the proposed framework. Specifically, we focus on complex scenarios in which each intersection covers around 12 movements with a single camera angle that can make the scenarios difficult for monitoring vehicles as shown in Figure 1. The remainder of this paper is structured as follows: The literature review of traffic analysis using DL is presented in Section 2. Moreover, recent object detection and tracking methods and vehicle counting systems are also reviewed in this section. In Section 3, we propose a video-based MCMM vehicle counting for large scale traffic analysis. The experiment results of our proposed approach are presented in Section 4 which is evaluated on the CCTV data that have collected and preprocessed at a complex area. The conclusions of this study are given in Section 5.

Traffic Analysis Using Deep Learning
The rapid growth of traffic data becomes an emergent challenge in ITS, for which traditional processing systems are not able to deal with the data analytics requirements. Recently, DL has introduced as a promising approach to deal with the various characteristics of traffic data (e.g., highly nonlinear, time-varying, and randomness) [9]. Specifically, different DL models enable different data representations to be learned for different applications. In particular, Figure 2 depicts the applications of DL models for different fundamental tasks in ITS [10]. Specifically, there are four well-known DL models have been widely applied for various applications in ITS such as Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Deep Q-Network (DQN). Consequently, with the recent successful development of CNN-based methods, people detecting and tracking processes have been executed with significant achievements [11]. Recently, vehicle detection and tracking have attracted more attention for the development of ITS. Figure 3 depicts the flowchart of the vehicle monitoring using video data for the smart traffic control system [12]. Specifically, there are three main steps which are: (i) Vehicle detection and tracking are executed for extracting the vehicle information (e.g., type, movement, and speed of vehicles); (ii) Then, traffic analysis tasks (e.g., counting, predictions, and abnormal detection) are performed to understand the traffic condition at a certain time [13,14]; (iii) Finally, the dynamic traffic control algorithms based on traffic condition are proposed to optimize traffic flows [15,16].

Moving Object Detection and Tracking Methods
Detection and tracking of moving objects (e.g., people, vehicles, and birds) have been widely applied in many applications (e.g., action recognition, smart traffic control, industrial inspection) and currently represent a major challenge in computer vision [17]. Currently, the standard approach for tracking moving objects from video sequence follows the tracking-by-detection paradigm in which the set of detected bounding boxes extracting in each frame is the input of the tracking process to perform data association for monitoring the object trajectories as shown in Figure 4 [18]. Recently, the rapid development of DL models has achieved great success for object detection in terms of extracting features and classify the type of object [19]. Technically, there are two categories of the object detection methods which are: (i) Single-stage methods perform the detection directly over a dense sampling of possible locations, which achieve a high speed of the detection (e.g., SSD [20], YOLO [21], and RetinaNet [22]); (ii) Two-stage method first use a region proposal network to generate regions of interests. Then, optimizing the regressions process on the region candidates. Consequently, comparing with Single-stage, this approach achieves a higher accuracy but slower speed for the detection process (e.g., Fast R-CNN [23], Mask R-CNN [24], and R-FCN [25]).
Object tracking is defined as the generation of the path and trajectory process of moving objects across subsequent frames. Specifically, depending on the target of the tracking process, there are two categories of tracking methods which are Single object tracking (SOT) and Multiple object tracking (MOT) [26]. In the case of SOT, the tracking process do not base on the detection since the methods track a certain object from the beginning. Two well-known methods for SOT are Kalman Filter [27] and Particle Filter [28]. On the other hand, MOT follows the tracking-by-detection paradigm in which the tracking methods reply to the resulting output in each frame of the detection process. Currently, there are two state-of-the-art methods for MOT which are DeepSORT [29], an extension of the SORT algorithm [30] and TC [31], a method using semantic features (e.g., trajectory smoothness and temporal information) for data association in each single view.

Vehicle Counting System
Vehicle counting is one of the main applications of computer vision for traffic management [32]. Figure 5 depicts the general pipeline of a video-based vehicle counting system [5]. Accordingly, detection and tracking processes are executed to detect and monitor vehicles. Then, virtual lines are set for counting vehicles when the centroids of the vehicle pass the lines. Specifically, this concept has been widely applied for both people and vehicle counting [4,33,34]. Recently, many studies have proposed the video-based vehicle counting framework based on this concept. For instance, Xiang et al. [35] presented a novel framework for vehicle counting using aerial video. Specifically, the object can be detected following two cases such as a static background for detecting and moving background to estimate the movement of vehicles. In the case of highway scenarios, Song et al. introduced a counting system by using YOLOv3 for detecting the type and location of vehicles. Then, the ORB algorithm [36] was adopted for the vehicle trajectories [37]. In the case of analyzing traffic flow in complex areas (e.g., intersections) with different types of vehicles, the authors in [38] proposed a vehicle counting framework by using three-component processes such as object detection, object tracking, and trajectory processing. In particular, the YOLOv3 algorithm was adopted for vehicle detection. Then, a matching method, based on the detection outputs, was proposed for tracking vehicles. Finally, a trajectory counting algorithm based on encoding was proposed for counting vehicles. In order to improve the counting performance, Liu et al. [39] proposed an adaptive pattern based on the virtual loop and detection line methods.
However, referring to previous works, there are several research issues that need to be taken into account to improve the video-based vehicle counting system, which includes: • Integrating an effective vehicle tracking method in order to deal with the switch ID problem of vehicle tracking in the case of heavy occlusion and different lighting and weather conditions. • Proposing a counting method by generating semantic regions to deal with the occlusion problem for monitoring and counting vehicles in complex areas that involve complicated directions (e.g., an intersection with 12 conflicting directions/movements).
In this regard, this study proposes a comprehensive framework with multi-class and multi-movement vehicle counting in which we focus on short-term vehicle tracking based on semantic regions in order to improve the tracking process; therefore, the accuracy of counting will be improved. Additionally, we adopt DeepSORT [29], a tracking approach that has proven effectiveness for people tracking, to deal with the varying lighting problem. Furthermore, since our method tracks and counts vehicles by determining the distinguished regions as the input data, the proposed framework is able to work well with different types of scenarios such as intersections, roundabouts, and highways. More detail of our proposed framework is described in the following section.

System Architecture
Let O represent the set of the result output in which the format of the result is as follows: where Vdo ID is the video numeric identifier and Mov ID denotes the movement identification of Vdo ID . V ID and Class ID represent the identification and type of vehicle, respectively. Figure 6 demonstrates the pipeline of our proposed framework. Specifically, we first follow the tracking-by-detection paradigm for monitoring vehicles, then an appropriate distinguished region approach for reducing long-time tracking is executed to improve the performance of the counting.

Vehicle Detection
This process is an essential step for MOT in terms of extracting object information (i.e., location and class) [40]. For the proposed framework, we adopt Yolo which trained on MS-COCO dataset for the vehicle detection process for the following reasons: • The method belongs to the single-stage which is able to perform the detection process much faster than two-stage methods.

•
Recently, the new version of Yolo (Yolov3) is able to perform the high accuracy of the detection for MOT by conducting 53 convolutional layers, which is able to work well with various dimensions in each frame [41]. • MS-COCO dataset provides the labeling and segmentation of over 80 different classes of objects.
In this regard, we are able to detect and track with different types of vehicles such as Car, Bus, Truck, and Bike [42].
Specifically, Figure 7 illustrates the vehicle detection process by adopting Yolov3 of the input video. Consequently, the output of this process is a list of the bounding boxes in each frame as follows: < class, x, y, w, h, con f idence > where class and con f idence denote the type (e.g., car, bus, truck, and bike) and the score of the detected vehicle, respectively. The parameters (x, y, w, h) indicate the position of the bounding box. Consequently, to optimize the performance of the detection, the output of the detection process are filtered as follows: • The class of the detected object belongs to the type of vehicles such as a car, bus, truck, and bike. • The con f idence of the detected object is larger than a certain threshold.

Vehicle Tracking
In this study, since the main focus is to extract traffic flow information of independent scenarios, we adopt the DeepSORT method for the vehicle tracking process. Specifically, there are two main processes in DeepSORT, which include the following: (i) Hungarian algorithm is applied to identify the vehicle appearance across frames; and (ii) Kalman Filter algorithm is used to predict future positions to update the target state. Therefore, the output of this process for a target vehicle is formatted as follows: < x, y, w, h,ẋ,ẏ,ẇ,ḣ > where (x, y, w, h) and (ẋ,ẏ,ẇ,ḣ) represent the current and velocity of the target vehicles, respectively. Specifically, Figure 8 illustrates the vehicle tracking process using DeepSORT. Furthermore, in order to get the feature extractor of vehicles, we adopted two popular vehicle datasets, which are the VeRi [43] and CityFlow [31] datasets, for training the appearance features instead of using the original feature extraction of DeepSORT, which focuses on a people dataset (i.e, MARS [44]). Consequently, Figure 9 depicts the classification accuracy by training the appearance descriptor using the deep cosine metric [45] of both aforementioned datasets.

Distinguished Region Tracking-Based Vehicle Counting
As we mentioned above, for counting vehicles, virtual lines are set in each direction to record the traffic volume as shown in Figure 10. Specifically, in each frame, the current centroid position (p v cur ) of the detected vehicle is computed. Then, the vehicle will be counted at a certain virtual line if the centroid of the vehicle passes the line, and the movement of the vehicle will be calculated based on the corresponding line that the vehicle enters the area. For more detail, the Algorithm 1 demonstrates vehicle counting using virtual lines.
However, the main challenge for monitoring the turning movement of vehicles is that we have to track the vehicle in a long-time range. Therefore, the switch ID problem, especially in the scenarios of heavy occlusion and varying lighting conditions, will significantly affect the performance of the counting. In this regard, we define a set of distinguished regions for reducing the range of tracking. Consequently, instead of using the virtual line, we used distinguished regions for counting the vehicles.  We define a set of distinguished regions R in each scenario/camera, which is able to cover all the movements for monitoring vehicles. Specifically, the number of regions depends on the number of turning movements.
For instance, Figure 11 depicts an example of the set of regions for monitoring the movement of vehicles. In particular, instead of using virtual lines, we used distinguished regions for tracking and counting vehicles, which were able to improve the performance of the vehicle counting for the two following reasons: • Reducing the range of tracking.

•
Avoiding occlusion in the case of multiple vehicle passing at the same time.

Conflicted Regions Tracking:
Since different scenarios have different geographies and viewing angles of the camera, the vehicles might move across multiple regions. In particular, in some specific cases, multiple movement counting in one vehicle will occur. In order to deal with this issue, we define a list T, which is the tracking list of vehicles. Specifically, when a vehicle belongs to the list T, it will not be tracked if moving to other regions, which is demonstrated in Figure 12. Moreover, another issue of the overlapped regions in a given movement is that vehicles might not be tracked in the original region, but in the others because of the detection problem (e.g., heavy occlusion). Consequently, the movement counting of the vehicle will be wrong. For example, in the scenario of Figure 10, the vehicle in Movement 5 will be counted in Movement 1 in the case it cannot be detected in Region 2. In this regard, we define a set of blank regions R 0 between tracking regions to reduce the wrong counting in this situation. Specifically, since the blank regions will include the vehicle in the case the vehicle is not detected in the original region, the rate of wrong movement counting will be reduced, which is demonstrated in Figure 13.

Multi-Movement Counting Formulation:
The vehicles will be counted if they move to the exit area (region) of the scenario. Accordingly, when the vehicle passes the exit region, the corresponded movement will be counted following the information of source and destination regions. Supposing R i = r i 1 , r i 2 , ..r i n is the set of regions in scenario/camera i, the targeted vehicle v will be tracked in region r if v moves into r, which can be formulated as follows: where p v 0 represents the centroid point of vehicle v and p r 0 and p r 0 are the top left corner and bottom right corner of region r, respectively. Consequently, the set of movements in scenario i, M i = mo i 1 , mo i 2 , . . . mo i k , will be counted based on the information of the exit region and the tracking region of the vehicles. In particular, Algorithm 2 demonstrates the modification of our proposed method compared with using multiple virtual lines in Algorithm 1.

Data Description and Experiment Setup
The dataset contained 39 video clips captured from the same area in which we pre-processed each video for around 10 min. Specifically, the input videos were classified into three time periods such as in the morning (7 a.m.-9 a.m.), afternoon (12 p.m.-2 p.m.), and night (5 p.m.-7 p.m.), which covered different conditions (e.g., lighting). The resolution of each video was around 1080p at 30 Frames Per Second (FPS). The Regions and Movements Of Interest (ROI and MOI) were also determined as the input information. In particular, Particularly, Figure 14 demonstrates the considered area for the implementation. For more detail, Table 1 shows the parameters that we used for the implementation. Consequently, there were four classes of vehicles that we took into account: Car, Bus, Truck, and Bike.

Experiment Results
Regarding the experiment, we first compared the counting results with the ground truth in several representative videos in each time period. Then, we compared our proposed method with vehicle counting using virtual lines, which is presented in Algorithm 1. Finally, we adopted the counting results to analyze traffic flow during a certain time period. Figure 15 depicts the screenshots of the multi-class multi-movement vehicle counting framework for three considered videos that belonged to different time periods. Accordingly, the experiments worked well with a PC with Core i7 16-GB CPU and 32GB GPU memory in which the GPU was used for the acceleration. As for the observation from the experiments, the FPS is around 12 to 14 which depends on the traffic flow densities.  Table 2 shows the counting results compared with the ground truth. Specifically, with the high condition of lighting and low traffic density, our proposed method was able to achieve more than 90% of the detected vehicles. Moreover, it achieved around 88% and 84% in the case of high traffic density and in the night condition, respectively. For more detail, Table 3 shows the accuracy in each movement of the first video. In particular, movements 1, 3, 4, and 8 ( Figure 10) belong to the opposite side of the camera angle cause lower vehicle monitoring performance than other movements. Specifically, as for the observation from the results, another issue of the decreasing accuracy was traffic densities since occlusion frequently occurred. Moreover, the similar appearance features between Bus and Truck sometimes made the wrong detection, especially when the view of the detected vehicles changed rapidly (e.g., Movement 2). However, the counting system was able to achieve good results by applying our proposed method. In particular, Figure 16 demonstrates our results compared with vehicle counting using virtual lines (Algorithm 1).

Traffic Analysis Based on Counting Results
As we mentioned above, one of the main applications of vehicle counting is to determine the traffic pattern and analyze the traffic turning volume for smart traffic control applications in complex areas. In this regard, we executed several implementations for analyzing traffic flow based on the counting results from the proposed framework. Specifically, Figure 17 demonstrates the average traffic turning volume in the morning time (from 7 a.m. to 9 a.m.). As shown in the figure, there was a big difference in the traffic volume for different movements, which could be an important result for applying dynamic traffic light control in order to improve the traffic flow [15]. Furthermore, Figure 18 shows the asymmetric traffic volumes in the considered area with different time intervals. Specifically, the traffic patterns of the implementation made sense since the time intervals were the time for going (Figure 18a) to work and coming home (Figure 18b), respectively.

Conclusions and Future Work
Recently, with the successful development of DL for computer vision, video-based vehicle counting has become a promising solution for analyzing traffic flow. However, this issue is still a challenging task due to the high similarity of vehicle appearances, heavy occlusion with high traffic density, and large variation in different viewing perspectives. In this study, we presented a comprehensive framework for multi-class multi-movement vehicle counting. Specifically, we first adopted state-of-the-art methods of object detection and tracking such as YOLO and DeepSORT for monitoring vehicles. Furthermore, a distinguished region tracking approach was proposed in order to improve vehicle tracking. As shown in the experiment, our proposed method was able to achieve a high performance for the counting results in which we evaluated the proposed method on the real data that we collected and pre-processed in a certain area.
From our point of view, there were several issues that could improve the proposed framework for vehicle counting as follows: (i) training an appropriate dataset for the detection process that is able to distinguish different types of vehicles (e.g., sedan, SUV, van, pickup truck, main truck, tractor-trailer, and 18-wheeler trucks); specifically, the COCO datasets only classify four types of vehicles, which are car, bus, truck, and bike; (ii) determining distinguished regions for tracking vehicles is still a manual process; in this regard, an optimal approach for generating the regions was able to improve the performance of counting. Moreover, tracking and counting vehicles across multiple cameras (multi-cameras tracking problems) using the proposed framework enabled the large-scale traffic flow analysis. The aforementioned problems are interesting issues that we will take into account in future work regarding this study.

Conflicts of Interest:
The authors declare no conflict of interest.