Efﬁcient Video-Based Vehicle Queue Length Estimation Using Computer Vision and Deep Learning for an Urban Trafﬁc Scenario

: In the trafﬁc engineering realm, queue length estimation is considered one of the most critical challenges in the Intelligent Transportation System (ITS). Queue lengths are important for determining trafﬁc capacity and quality, such that the risk for blockage in any trafﬁc lane could be minimized. The Vision-based sensors show huge potentials compared to ﬁxed or moving sensors as they offer ﬂexibility for data acquisition due to large-scale deployment at a huge pace. Compared to others, these sensors offer low installation/maintenance costs and also help with other trafﬁc surveillance related tasks. In this research, a CNN-based approach for estimation of vehicle queue length in an urban trafﬁc scenario using low-resolution trafﬁc videos is proposed. The system calculates queue length without the knowledge of any camera parameter or onsite calibration information. The estimation in terms of the number of cars is considered a priority as compared to queue length in the number of meters since the vehicular delay is the number of waiting cars times the wait time. Therefore, this research estimates queue length based on total vehicle count. However, length in meters is also provided by approximating average vehicle size as 5 m. The CNN-based approach helps with accurate tracking of vehicles’ positions and computing queue lengths without the need for installation of any roadside or in-vehicle sensors. Using a pre-trained 80-classes YOLOv4 model, an overall accuracy of 73% and 88% was achieved for vehicle-based and pixel-based queue length estimation. After further ﬁne-tuning of model on the low-resolution trafﬁc images and narrowing down the output classes to vehicle class only, an average accuracy of 83% and 93%, respectively, was achieved which shows the efﬁciency and robustness of the proposed approach.


Introduction
The world has seen much growth in urbanization. Half of the population on Earth are already living in urban areas, and it is estimated that, by 2030, 5 billion people will be joining the populations in towns and cities, according to the United Nations Population Fund [1]. The incredible increase in road traffic due to urbanization has led to several transportation and traffic management problems, such as traffic congestions and traffic surveillance [2,3]. Research conducted under Intelligent Transportation System (ITS) can help reduce the possibility of road accidents, infrastructure planning and signal violation, identification of suspicious vehicles, queue length detection, and vehicle classification [4][5][6][7][8][9]. It can mitigate the road issues due to urbanization and is a way forward for smart cities.
Technological advances in wireless communication and computation power have brought revolutions to ITS. For example, developments in connected vehicles are booming and contributing to the implementations of next generation ITS. The connectivity in vehicles enables the collection of necessary data for better traffic management and control [10]. Recent advances in computer vision and deep learning have also enabled better vehicle annotations in traffic surveillance. The large-scale deployment of surveillance cameras at a huge pace all over the world has shown a great potential in the collection of vehicular data in a flexible way. ITS technologies such as smart road-side sensors, cameras with enhanced analytics functionalities, and dedicated communication band are widely adapted around the world [3]. This combination of cameras and sensors significantly reduce the number of traffic accidents and improved the traffic flow.
Computer vision is arguably one of the most exciting growing fields in Artificial Intelligence, which has brought transformative approaches to image understanding and video analysis for several fields, especially Intelligent Transportation Systems. One of the major impacting areas of ITS is the queue length estimation of vehicles. Figure 1 shows a random vehicle queue at an intersection. Vehicle Queue Length is defined as the distance from the stop line (What is Stop line in this paper) to the tail of the last vehicle stopped in any lane on the road while the signal is red. Vehicle queue length on the signalized intersection is recognized as a crucial parameter w.r.t performance measures in terms of determining traffic capacity and quality, vehicle delays, and stops on a signal. It is also important for traffic optimization on a signal in terms of traffic congestion resolution and prediction of traffic, intersection delay and travel times, etc. [11]. This automatic measurement of traffic parameters can be of great help to traffic engineers to accordingly plan a signalized road intersection to minimize the risk of blockage of the lanes and to provide a better level of service by informing drivers beforehand so that they can choose alternative routes to avoid delay.
Traffic and vehicle queue length information are generally extracted by using fixed sensors like loop detectors, radar, pneumatic road tubes, infrared, vehicle GPS devices and magnetic sensors or other moving sensors such as drones. The road sensors are more promising due to their advanced technology; however, their installation and maintenance costs are comparatively higher. The installation process also requires a lot of traffic disruption. The magnetic sensors are used on highways to detect vehicles when a change in Earth's magnetic field occurs, but the efficiency of these sensors can easily get affected by temperature and humidity. Radar sensors are very accurate and easy to install, but their main disadvantage is high exposure to electromagnetic interferences. Compared to these sensors, vision-based sensors use images obtained from surveillance cameras to determine the presence, location, and accuracy of surrounding vehicles. These sensors can detect vehicles across several lanes and can report vehicle presence, flow rate and occupancy, etc. These sensors can also classify vehicles by their lengths and can also detect speed for each type of vehicle. It is the most beneficial option because it can be used to perform all these tasks along with Q-length detection. Vision-based sensors have significant advantages for queue length estimation in the complex real-time environment, such as low installation and maintenance costs, simple to use, less affected by environmental changes like extreme weather conditions, etc.
In this research work, a state-of-the-art CNN-based vehicle detection model is proposed, which performs efficient queue estimation to address the urban traffic challenges. The proposed model can estimate the queue lengths of vehicles (particularly cars) without the knowledge of any on-site camera calibration information. Using the developed model, a fairly accurate estimation of vehicle queues for dense urban traffic can be made.
The major contributions of the paper are highlighted as follows:

1.
A detailed comparison of deep learning architectures including SSD, ResNet, Inception, and YOLO with inexpensive training dynamics for estimation of vehicle queue lengths based on images obtained from low-resolution traffic surveillance cameras.

2.
Selection of an application-specific dataset with a far-field fixed camera view for better estimation of queue lengths while also considering the depth of a queue and camera angles.
The remainder of the paper is organized as follows: Section 2 reviews the literature regarding queue length estimation. Section 3 describes the methodology adopted to estimate the queue length. Section 4 discusses the implementation of the proposed model in detail. Section 5 discusses and demonstrates the effectiveness of the proposed method. Finally, Section 6 concludes the paper and presents a future research direction related to the proposed research.

Literature Review
Several works in the literature are available which present efficient models for queue length estimation. A common approach is to develop a mathematical model that models queue length using data collected from sensors installed on the car or the road. The literature on queue length can be categorized into three groups: • One based on fixed-location sensors which consists of sensors that are fixed in a stationary position, i.e., a road sign or traffic light. • The second uses mobile sensors that are in constant motion, i.e., placed in a car or any vehicle. • The third is based on CNN-based object detection architecture using urban traffic camera feeds.
In works such as [11][12][13], fixed-location detectors such as loop detectors and traffic cameras are installed on the roadside. These sensors are utilized to gather traffic flow information such as the number of vehicles, speed, and travel time, which is then fed to models to approximate the queue length. In works such as [14][15][16][17], traffic flow information is reported from in-car mobile sensors or vehicle-connected technologies such as GPS sensors, RFID sensors, or custom-made sensors [18]. Different techniques are used to model queue length for both fixed-location sensor-based and mobile sensor-based methods, and the prevailing techniques used are shock wave theories.
Amini et al. proposed a real-time queue length estimation system based on a stochastic gradient descent technique that learns detector bias and estimates the queue length. The problem was considered at a traffic intersection from biased and noisy vehicle count observations. Blokpoel et al. compared three algorithms for queue length estimation. The first technique uses GPS data only, while the second is appended with a wave speed algorithm. The last one assumes low penetration rates on the road and combined traditional stop line and cooperative detection detect queue estimation, which resulted in up to 33.6% traffic light delay reductions for a better estimate of queue length. RFID detector databased queue length estimation is proposed in [19]. The authors exploit the queue delay of individual vehicles to measure intersection queue length in actual lengths (meters). The model was tested on a real-time single-approach intersection during morning peak time and had about 15% of the estimated error on average. However, the proposed model does not transfer well to the intersection with multiple approaches because of added complexity in shockwave theory-based modeling. The authors in [20] determine the number of queued vehicles in each lane. They used upstream and downstream detectors to develop variant models. A Kalman filter is implemented to find out the volume of total traffic in each lane. Their best model achieved an average RMSE of fewer than six vehicles when applied to data from a real junction. In [14], a method based on traffic shockwave polygons in the time frame is proposed by Ramezani et al. The method not only captures the relationship between successive intersections, but it can be applied on oversaturated conditions (When several vehicles on the road has exceeded design capacity) as well. It produced promising results compared to the standard arrival queue length estimation process. In [15], the cycle-by-cycle queue length estimation method is proposed for probe vehicle data, which is obtained from vehicles participating in traffic flow. The results attained were benchmarked against paper from Ramezani et al. and showed more accuracy under a low penetration rate.
Ki An et al. devised a numerical model for queuing length estimation at a metering roundabout [21]. It helps to create gaps in the circulating stream in solving the problem of excessive queuing and delays. These delays are caused by unbalanced flow patterns and high demand flow levels. The model estimates queue lengths based on data such as advanced detector locations and time phases of the traffic signal. Drones were used to capture the ground-truth queuing lengths which were used to calibrate and validate the model. The validation of model performance is analyzed using the R2 test and achieved an 83% prediction rate. Tiaprasert et al. collected data from the portion of connected vehicles in a pool of vehicles before stopped lines are used to determine the approximate positions, speed, and a number of vehicles. It then leads to an approximation of queue length. This method requires a certain penetration ratio of connected vehicles (ratio of connected vehicles in all vehicles in the queue) to achieve a good estimation of queue length. Another study in [22] also uses probe vehicle technology for queue length estimation. Using the optimal Bayes Rule model, promising results were produced for lane identification which is further used for queue length estimation. The numerical results showed a 40% penetration rate to reach about 90% accuracy.
In [23], a real-time vehicular queue length measurement system for intersections based on camera feeds is proposed. The proposed system first detects the queue by frame differencing method to analyze the motion in focused areas. While there is no motion, the vehicles in that area are calculated using CNN based SSD networks. The system works with video feeds obtained from stationary cameras. Similarly, a system was proposed in [24] for real-time measurement of vehicle queue parameters while utilizing the traffic camera feeds. It is based on vehicle presence detection and movement analysis in the set of random traffic videos acquired by a stationary camera. The system can detect queues in lanes of interest. Albiol et al. presented an approach to estimate traffic queue lengths in real time by collecting and evaluating low-level features such as corners to detect the presence of vehicles [25]. After that, features are classified as either moving or static. The algorithm only requires locations of lane masks, starting point of queues and other such less complicated setups. Shirazi et al. presented a tracking method for estimation of queue lengths and waiting time of vehicles at junctions [11]. They placed a motion area on a location near to the camera where vehicles generally do not stop. They applied background subtraction in this area to detect all the moving vehicles in the area. The vehicles are continuously tracked until stopped. The total number of vehicles that cross the motion area helps with estimation of queue length. The only drawback is that the system is unable to detect queue lengths beyond this motion area. In [11], the research work for estimating queue length using camera feeds is carried out. They employed an enhanced optical flow technique to estimate queue length and waiting time in a highly cluttered video. However, the outcomes for difficult tracking conditions like reduced features for small and poor-quality vehicles were not satisfactory, and sometimes the algorithm would lose track of vehicles that are stationary. The authors from [26] collected traffic images from Custom-installed cameras that imitate actual traffic cameras, which constitute the training and testing sets for their YOLO-based models. The training of their models makes use of transfer learning [27]; the initial parameters of their model are loaded from a model with weights trained with the ImageNet dataset, and further fine-tuning of the network is performed on the traffic image they obtained. In addition to changing the sizes of the convolutional layers of the YOLO network, converting the output of the network from an 80-class classifier into a binary classifier also improves the performance of the network. The final improved model is reported to have achieved an average accuracy of greater than 90% in detecting vehicles from traffic flows of different densities.
In summary, there is a limited research on CNN-based object detection architecture for Q-length estimation at traffic junctions using urban traffic camera feeds. The reviewed literature is limited to producing encouraging results on a single intersection but is not suitable for a network of intersections. In addition, the level of complexity increases in designing adaptive signal control systems. Vehicle tracking in the diverse environment like low-resolution images, extreme weather, or loss of tracks for long waiting stationary vehicles still need further enhancement.

Research Methodology
The overall proposed queue length estimation pipeline is shown in Figure 2, where object detection model YOLOv4 performs an inference on traffic videos of choice and use the vehicle location output from the model to calculate queue lengths with three different methods.

Definition of Queue Length
Queue length is the distance from the stopping line to the last vehicle in a queue during one signal cycle. It is measured using two methods; the first method uses vehicle count information, and the second method uses image pixels as metrics. The pixel-based method requires a knowledge of camera parameters for camera calibration purposes. However, in traffic control applications, knowing the number of vehicles waiting is more important than the physical length in terms of meters, and therefore, vehicle count as a variable of prime interest is chosen in this research. To simulate the pixel-based queue length (without camera calibration) for each lane at any intersection, the average vehicle size was approximated to be 5 m. The ground truth of queue length was established by summing up all the visible vehicles in each lane, regardless of the distance between the vehicle and traffic light. This is a simplified approach for queue length estimation without taking any physical/pixel distance into account. Since the location of vehicles is defined by bounding boxes, pixel distance between the front and last vehicle in the queue (shown as red lines in Figure 3) is therefore used to calculate queue length. The pixel distances are calculated using the midpoints of the bounding boxes of front and back vehicles in the queue.  [28]. Deep SORT uses the appearance features to track objects in longer periods of occlusion. This consists of appearance metrics, although bounding box distance is used as a gating process. The deep SORT is used to track multiple object detection by assigning tracking ID to each unique vehicle identified (as shown in Figure 4a). The stopped vehicle is identified at a stop lane by finding the movement of these unique vehicles across consecutive frames. When any vehicle does not show any motion in two consecutive frames, it is identified as a stopped vehicle on the stop lane (e.g., Figure 4a). After that, queue length is calculated (as shown in Figure 4b).

Identification of Suitable Video Streams
Originally, the Toronto traffic cameras were selected as the input video streams to be analyzed. The camera feeds hosted by Toronto Open Data [29] include images from hundreds of traffic cameras in the Toronto city area. Five camera feeds were selected, where each camera has a unique angle view to the traffic lanes (see Appendix A Table A1). Figure 5 shows one of the selected camera footages. The purpose of selecting five cameras is to first identify the performance of popular object detection models on detecting vehicles in low-resolution images; secondly, to choose unique video feeds for queue length detection. However, the camera feeds provided by Toronto Open Data only give updates to the camera image about once per minute, which was not sufficient to form continuous image frames or videos of a red light/green light session. Moreover, we found that the cameras change their position many times in a day. At times, the cameras do not even face the traffic lanes and therefore no vehicle is captured. Thus, the nature of the Toronto Open Data camera feeds make it difficult to use for the purpose of training a vehicle detection model.
Subsequently, we selected available traffic camera videos featuring the traffic intersections in Burlington, Ontario. Compared to the camera footage from Toronto Open Data, the Burlington videos are continuous frames, with hour-long lengths, which make it a good option for both training vehicle detection models and validating real-time queue length analysis. The videos also offer scenes of different depth i.e., there are videos with vehicle queue lengths ranging from 20 m to 100 m. This offers the opportunity to explore the effectiveness of this research in different queue length distance scenarios.

Selection of Algorithm/Model
A challenge with camera feeds Q-length detection is the unavailability of inefficient scene analysis systems for video images. Since the advancements in computer vision and deep learning, the data retrieving capacity of video analytics systems has been immensely improved. One major aspect of Q-length detection using camera feeds is the ability to detect and localize vehicles near the traffic signal line. Object detection is one of the fundamental computer vision problems which provides a semantic understanding of sample images. It aims at locating and recognizing the objects in any image and then generating regional proposals with a confidence of existence [30]. This paper uses a Convolutional Neural Network (CNN) based architecture to perform efficient object detection (or vehicle detection) in the given camera feeds.
There are several CNN-based object detection architectures available off the shelf with very high reported accuracies. In [31], the authors implemented three detection architectures Faster R-CNN, R-FCN, and SSD, to compare the accuracy, speed, and memory utilization for any given application. The results reveal that, for fewer proposals, Faster R-CNN has the highest speed, more importantly, without any major loss inaccuracy. One of the most popular detection architectures has been proposed by authors in [32]. The authors showed that You Only Look Once (YOLO) was the latest approach for object detection, which has surpassed these different architectures. It uses only one evaluation to predict the bounding boxes and class probabilities from full images. YOLO has so far received five upgradations, while Yolov5 is the latest one with better performance in terms of both accuracy and speed than the previous upgrades. YOLO is often used for various vehicle related recognition tasks and has shown significant improvements in terms of processing time and accuracies [33][34][35][36].
The capabilities of the modern object detection model bring the possibility to approach queue length estimation task from the perspective of vehicle detections and analysis for traffic videos or images. Having high accuracy in vehicle detection enhances the plausibility of using camera-based methods to identify queue lengths, as it produces accurate bounding boxes for vehicles.
The inference results of different pre-trained state-of-the-arts object detection models on a set of images from the Toronto traffic camera were then compared. Models tested include YOLOv5, YOLOv4, YOLOv3, SSD, ResNet 101, and Inception V3. Firstly, YOLO models implemented in Keras with TensorFlow backend were used to detect vehicles in the selected traffic camera images. A pre-trained object detection model has the capacity to predict 80 COCO classes in one image that is passed through the model pipeline. However, all non-vehicle classes being irrelevant to the research are restricted i.e., only the prediction classes of interests were selected to be displayed in output images and terminal. Other than that, the confidence level of predictions was also lowered to show maximum results for vehicle detection. Appendix A Table A1 shows the output of a pre-trained YOLO model on selected traffic camera images. The other models were also pre-trained on the same selected camera images with the COCO dataset having 80 classes. The model and inference pipeline are available from an official TensorFlow open-source repository. The same modifications/adjustments in the pipeline for the YOLO model were done on the other models as well.
Appendix A Table A2 shows the comparison of all these models. A YOLO model clearly performed better than other Object detection models of choice. The SSD model failed to recognize most of the vehicles in the images, while ResNet and Inception performed equal to or less accurately than the YOLO models. However, the time required to perform one image inference is much longer for the ResNet and Inception model. Since faster inference speed is an important design criterion in real-time queue length detection, it was decided to choose YOLO as the object detection models to be used for the given problem.
The inference was then done on the images from Burlington traffic videos using the same 80-classes pretrained YOLOv3, YOLOv4, and YOLOv5 models. Good vehicle detection accuracies were obtained under varying camera angles for different YOLO models, as shown in Appendix A, Tables A3-A5. However, the pre-trained model failed to recognize vehicles in the back of traffic lanes or when vehicles become tiny pixel points in low-resolution images. It also failed to recognize vehicles that are stacked between queues of vehicles. Therefore, transfer learning was considered as a method to improve the vehicle detection accuracy for the YOLO model. Both YOLOv4 and YOLOv5 have some similar performance (as shown in the tables mentioned above). Similarly, Deep SORT implementation was also compatible with YOLOv4 implementation. Therefore, finally, YOLOv4 was used to perform the experimentation.
Since the object detection predictions for only vehicle classes are being used in queue length calculation, all vehicle classes were then considered as one class during labeling images for transfer learning. Instead of predicting 80 classes in the model, having a single class object detection model simplifies the inference process as there is no need to hide other classes in final output. It is also possible to perform transfer learning on five vehicle classes. However, given the unbalanced image data i.e., cars appear more often than trucks, buses, or motorcycles in traffic images, training a single-class object-detection model is more reasonable. As shown in Figure 6, significant improvement in vehicle detecting accuracy can be achieved after applying transfer learning for a single-vehicle class to the original object-detection model.

Generation of Dataset
The datasets for queue length inferences and the training of proposed vehicle detection model are sets of traffic videos from the city of Burlington, Ontario, Canada. The dataset videos were obtained in the form of frame sequences (.seq file), with a resolution of 320 × 240 pixels. Citilog software was used to open these set of videos. To convert the videos into a format acceptable by the OpenCV framework, videos were captured using a screen capture software. The converted format has 25 frames/s, which is slightly different from frames/sec of the original videos. Most videos have less than 20 frames/s as well, and some even have lower than that. However, as it is not required to output queue length at very small-time intervals (less than 0.1 s, for instance), using these recorded low frame rate videos is acceptable for evaluating queue length at or within every second. Ten videos from the pool of videos were selected, each of the videos represents a unique camera angle or lighting conditions at intersections. Some of the cameras are tilted sideways, which gives a tilted view of the driveway. Some videos are captured in winter conditions with snow on the ground. Subsequently, it was narrowed down to a pool of three videos which correspond to the longest queue lengths at 25 m, 60 m, and 80 m, respectively.
As a pre-processing step, the camera feeds were analyzed to identify lanes at the intersection and the stop lines. However, calibration was not employed from the actual ground results, so most of the reported results are defined in image pixel form. Each average car size is approximated to be 5 m, which gives an estimated longest queue length. This estimation is done while keeping in view the different types of vehicles on road including Sedans, Trucks, and Buses, etc. We considered all classes of vehicle types as a single class for accuracy improvement purposes; therefore, the 5 m was taken as an average of all the vehicle types. Ideally, the selected traffic junction needs to be calibrated, i.e., the image pixels be translated with actual distance on the ground. However, the focus of this research was total vehicle count, due to which pixel estimation was done without calibration information.
An example video snapshot taken from a random video among the pool of selected videos is provided in Figure 7. As shown in the snap, the left lane covers the longest area of about 5 cars which makes 25 m, the longest queue length range for that video. Generally, the camera viewing angle was frontal for all videos, which caused problem regarding vehicle poses. The problem includes identification of other vehicles in its closer vicinity. Similarly, frequent movements in camera positions also added to the problems. Table 1 shows a specification comparison of some Burlington videos.

Dataset for Transfer Learning
To avoid using identical images in transfer learning, we acquired images from videos that were not used for queue length evaluation. Videos were converted into individual frames which were then manually filtered so as to avoid having two or more highly identical images.
The images used in transfer learning have traffic lanes of physical lengths between 30 m to 80 m. The number of vehicles in each lane ranges from 0 to more than 5, which covers all the queue length scenarios in selected videos. In addition, 524 frame samples of a Burlington video not used in evaluation of the queue length were then manually picked. Frames were then manually labeled as "vehicle" with their coordinates marked according to the specified format for training of YOLO models. Labeling is done with an open-source tool LabelImg [37]. Figure 8 shows an example of a labeled traffic image where bounding boxes are drawn manually around the vehicles in the image. Coordinates of the bounding boxes can be recorded by the tool and exported to the Yolo/Pascal VOC labeling format, after which a Python script is written to convert them into the format suitable for the YOLO implementation. After this, the YOLOv4 model was trained with a single output class (i.e., vehicle) with 524 frame samples. A 9:1 Train-Validation split ratio was selected to maximize learning while the model does not significantly overfit on train set.

Queue Length Result Reporting
The final accuracy for queue length estimation in each video segment at every second is calculated while considering following methods. Let q ti be the prediction for the i th frame within the t th second, and n being the number of frames per second: The first frame at every second is sampled and is an inference for queue length All frames within a second are inferences, and we calculate the average of queue lengths in all frames All frames within a second are inference, and we choose the longest predicted queue length among all frames Q 3 (t) = max(q t1 , q t2 , ..., q tn )

Hardware and Software Configuration
Inference of traffic videos is made on a laptop with an i5-6300hq CPU, GTX960M GPU with 2 GB VRAM, and 8 GB of RAM. The versions of Python and the software framework used are Python 3.5.2, TensorFlow 2.2.0, and OpenCV 3.4.2. In the meantime, for training of the YOLO model, a Google Collaboratory computing platform with Intel Xeon 2 Cores CPU, Nvidia Tesla K80 GPU with 12 GB of VRAM, and 12 GB of RAM was used. The versions of Python and the software framework used were Python 3.6.7 and TensorFlow 2.2.0 respectively.

Code Implementation
The queue length inference and model training software pipelines are built on top of an open-source YOLOv4 implementation with TensorFlow. The original code provided ways to make inferences on images or videos with different kinds of object detection models, as well as how to train an object detection model. To perform the queue length inference, videos are processed using the OpenCV library and are fed to the object detection model frame by frame; they are passed through Deep SORT for identifying stop vehicles on stop lanes. Then, the queue length for every traffic lane at every second is calculated based on the outputs of the object detection model and the rules set for queue length definition. The output classes are then restricted to show only vehicle classes (car, truck, bus, bike, and motorcycle) when doing inference using the pre-trained 80-classes YOLOv4 model. Calculated queue lengths are then exported to Excel.

Training of a New Object Detection Model
Fine-tuning allows for modern object detection models to train for new object categories with the heavily reduced number of training data. In this research, the number of images used to train a single "vehicle" class is 524. To perform fine-tuning, firstly, the 80-classes COCO pretrained YOLOv4 model was selected as the base model to be fine-tuned. The fine-tuning process makes use of the features that are already extracted in the pre-trained model and optimizes neuron weights in new fully connected output layers to reflect the change in a number of classes to predict.
Changing 80-classes prediction to 1-class prediction requires changing the number of output tensor size from N × N × 255 to N × N × 18, where 255 and 18 are both the number of predictions done by YOLOv4, which is calculated with the formula: where 3 represents three different scales of feature maps used by YOLOv4; 5 is the number of parameters to be predicted for each bounding box (including the dimension and location of the box and confidence score of prediction), and m is the number of classes to be predicted by YOLOv4.
The training process of a new YOLOv4-based single-class object detection model is to unfreeze the weights of all model layers and fine-tune all the weights in the entire YOLOv4 model. After early stopping conditions for training are met, the final model can be obtained and used to perform queue length inference for traffic videos. Loss functions used for training were the same as the original function used in YOLOv4, where the parameters' center location of a bounding box, object confidence, and class classification is predicted by logistic regression using cross-entropy functions, while the width and height of a bounding box are predicted with linear regression with a mean square error.
Training records are logged and visualised with a custom python code. The training loss (graph shown at Figure 9) reduced from 20.0 to 1.9 after 1000 iterations of weights' fine-tuning. The loss stabilizes at 1400 iterations with slight fluctuations. The Mean Average Precision (mAP) starts at 1000 iterations from 94%. The mAP falls to 93% at 1400 iterations, while achieving the highest mAP of 95% at 1700 iterations. The slight increase in training loss near the end indicates a possibility for overfitting due to the 9:1 train-test split. Although the train and test sets are randomly sampled, a possible overfitting is still observed. However, the object detection model still outperforms the original vehicle detector, as shown in Section 5. Due to no major change in loss or accuracy after 1700 iterations, the model was stopped early to avoid further overfitting. The final model used for queue length inference went through around 1900 iterations of fine-tuning on all weights of the model, with the mAP fixed around 93% to 95% while the training loss remained at 1.47 on average.

Results
The accuracies were compared for selected videos both before and after applying fine-tuning. Figure 10 shows how three different vehicle count measurement methods discussed in Section 3 compares for the red-light intervals in each of the traffic videos, using an 80-classes COCO pretrained YOLO model. The poorest results are seen in B34M red light sessions. The reason behind this result could be that the pretrained YOLO4 model sometimes fails to detect vehicles in the very back of a lane or vehicles that are stacked together. The B20M video has more frames where most vehicles are in the back, which made vehicle recognition harder because of the smaller area and blurry vehicle appearance that mixed well with the surroundings. Since the vehicles in the back are not always detected, the average or one-time inference will give a shorter queue length compared to the max queue length in a second.  Tables 2 and 3 show the improvement of queue length accuracies (in terms of vehicle count and pixel distance, respectively) after applying fine-tuning to the YOLOv4 network. The results for each video have been averaged across all lanes and all three queue length measuring methods.  The greatest improvements are seen at 30% in a single traffic light scenario after applying the new model to sampled videos. Most of the videos are significantly improved after applying fine-tuning, except for the count accuracies in video B20M. The success of this improvement is a result of increased accuracy for recognition of vehicles in the videos. Fine-tuning enables the YOLOv4 network to recognize vehicles which are small or ones with colors that can easily be blended into part of the road, as shown in Figure 11. In addition, the new model detects cars stacked in queues that were not picked up by the original network, in the case of Video B34M and B30S. Confidence scores assigned to detected vehicles are also observed to be much higher than before. Nevertheless, there is still room for improvement as the vehicle count accuracies for the B20M video are much lower than the other two videos. The lack of extremely stacked scenarios in fine-tuning labels is likely accountable for the model's weaker recognition ability in the B20M video. If video frames that have the same stacked vehicle density are used in fine-tuning labels, the recognition result should improve.
(a) (b) Figure 11. Difference of vehicle detection in Video B34M at 52 s (a) before fine-tuning (cars and buses marked as purple and green colors respectively) (b) after fine-tuning (vehicles marked as yellow).

Conclusions and Future Work
In this research, a queue length estimation method based on AI and Computer Vision technologies using low-resolution traffic surveillance cameras is proposed. Queue length of vehicles is obtained in terms of total length in meters from the vehicle at the front to the last vehicle in any particular lane. The overall queue length is estimated based on approximation that the average vehicle size is 5 m in length. All lanes are manually labelled on the given images. Identification of the stop line is done through Deep SORT implementation which defines the start of a lane. Ground truths are established by summing all vehicles in each lane. The results have shown a decent performance for vehicle recognition using low-resolution traffic camera feeds. This shows the effectiveness of proposed state-of-theart CNN-based object detection architecture for queue length estimation. It is shown with experimentation that CNN-based architectures are effective in improving vehicle detection accuracies. It is also proved that fine-tuning is highly effective in improving the accuracies even for low-resolution images.
The presented research findings for estimation of queue lengths at traffic junctions can be further enhanced by including intrinsic and extrinsic camera parameters. This camera calibration information will help obtain physical queue length distances instead of pixel ones. Similarly, the bounding box being larger for a large vehicle (e.g., truck) causes the queue length to change as the average size of vehicle varies. This can be solved in future by labeling vehicles with different classes to cater for their sizes. Vehicle recognition accuracies can be further improved by changing the underlying neural network structures in the models. Similarly, transfer learning can be applied to other state-of-the-art object detection models which may result in higher queue length accuracies, although the inference speeds may suffer from using those models.  Data Availability Statement: All the relevant data is publicly available to download and can be accessed at the following link: https://drive.google.com/drive/folders/1DI7-81g92__bdaJrSolHrCVUmBB8 6wKo?usp=sharing (accessed on 22 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Loc8018e
Loc8037n Loc8043w Table A3. Vehicle detection accuracies on selected images (from Burlington Video) using YOLOv3 (cars and trucks marked as purple and blue colors respectively).

Video Sample Original Image Recognized Result Image Ground Truth Vehicles Detected Accuracy
Burlington-WL-HV-SYNC-11090E0000000AB4-2015-01-08  Table A4. Vehicle detection accuracies on selected images (from Burlington Video) using YOLOv4 (cars marked as red, green, blue and purple colors, buses marked as light green color).  Table A5. Vehicle detection accuracies on selected images (from Burlington Video) using YOLOv5 (cars, buses and trucks marked as orange, neon green and light green colors respectively).