Applying Ternion Stream DCNN for Real-Time Vehicle Re-Identification and Tracking across Multiple Non-Overlapping Cameras

The increase in security threats and a huge demand for smart transportation applications for vehicle identification and tracking with multiple non-overlapping cameras have gained a lot of attention. Moreover, extracting meaningful and semantic vehicle information has become an adventurous task, with frameworks deployed on different domains to scan features independently. Furthermore, approach identification and tracking processes have largely relied on one or two vehicle characteristics. They have managed to achieve a high detection quality rate and accuracy using Inception ResNet and pre-trained models but have had limitations on handling moving vehicle classes and were not suitable for real-time tracking. Additionally, the complexity and diverse characteristics of vehicles made the algorithms impossible to efficiently distinguish and match vehicle tracklets across non-overlapping cameras. Therefore, to disambiguate these features, we propose to implement a Ternion stream deep convolutional neural network (TSDCNN) over non-overlapping cameras and combine all key vehicle features such as shape, license plate number, and optical character recognition (OCR). Then jointly investigate the strategic analysis of visual vehicle information to find and identify vehicles in multiple non-overlapping views of algorithms. As a result, the proposed algorithm improved the recognition quality rate and recorded a remarkable overall performance, outperforming the current online state-of-the-art paradigm by 0.28% and 1.70%, respectively, on vehicle rear view (VRV) and Veri776 datasets.


Introduction
Traffic monitoring is an indispensable tool used for collecting statistics to enable better design and control of transport infrastructure [1][2][3]. As a result, many applications have emerged to improve traffic management, focusing only on vehicle counting in urban streets [4]. However, plain vehicle counting was proven not to be sufficient for locating and distinguishing between vehicle types or models [5]. Then vehicle localization and identification [6] have become an area of interest in the computer vision community for solving vehicle-related criminal activities such as theft in urban areas [6]. Additionally, numerous research projects were conducted to solve various environmental challenges in vehicle detection, re-identification, and tracking across multiple camera views [5,7]. However, most of these proposed algorithms are observed offline and are not ideal for realtime tracking [8]. Moreover, they make it difficult for human observers [9] to remember and efficiently distinguish between a wide variety of vehicle makes and models [10]. It became an arduous task for a human being to monitor dozens of screens for incoming and outgoing vehicle models [11]. Therefore, in an attempt to resolve this issue, refs. [12][13][14][15][16] proposed algorithms to distinguish vehicles based on shapes, size, traveling speed, and distance from camera views. However, the algorithms disregarded the vehicle's visual information framework had trouble handling similarities in shapes, sizes, and colors of the vehicles, and this resulted in poor detection.
In conceptualizing the multiple steam for multi-task handling in the computer vision community, refs. [34,35] illustrated how the three streams CNN with multi-task can be the ensemble to learn and classify 3D objects. This inspired [36,37], who proposed multi-level feature extraction DCNN for vehicle Re-ID across non-overlapping cameras. Their algorithm illustrated robustness in combining the small image patches but had difficulty recognizing the far, blurry, and out-of-range vehicle images. Moreover, it could not differentiate the vehicles of the same make, model, and color [38] with different plates.
Inspired by these discoveries, we proposed the Ternion streams DCNN to detect, track, and Re-ID vehicles. Our algorithm uses three streams of DCNN to provide an accurate estimation direction for the number of moving vehicles based on non-overlapping camera views. We investigated and used each stream independently to extract vehicle characteristics such as shapes and plates with an integrated OCR stream for low character resolutions, then, fuse these attributes from the streams to form a complete vehicle surveillance recognition and tracking system. We further used distance descriptors and vectors to measure the vehicles' similarities. This paper is organized as follows: in Section 1 we give an introduction and background; Section 2 describes the Ternion streams DCNN framework; Section 3 lists experiment materials and parameter settings; Section 4 presents Results; in Section 5, we discuss and analyze the results; and then finally we draw up the paper conclusions in Section 6.

Proposed Ternion Stream DCNNs for Real-Time Vehicle Tracking (VT)
The main task is to track and re-identify the target across these multiple cameras [19][20][21]. We, therefore, designed our algorithm to detect, track and re-identify the vehicles across several non-overlapping cameras based on the Ternion stream DCNN (TSDCNN) framework. We implemented the proposed algorithm using the dataset that contains different shapes of vehicles [23] and different illumination conditions. The algorithm relies on three streams of the DCNN, which are independently used for extracting vehicle characteristics like shapes and plates. The detection of plates was buttressed [24,26] with the integration of the OCR stream, which has proven to cater to illegible characteristics on damaged plates and low resolutions. According to our best knowledge, there is no similar proposed algorithm for real-time object tracking across multiple non-overlapping cameras. The TSDCNN architecture and working procedure are presented in Figure 1. The model is trained using datasets with multiple vehicle types' videos. However, the input to the first stream is the sequences of RGB frames' 96 × 96 pixel cropped images for an application detector to perform detection, Re-ID processes, and searches using two phases. In the first phase, the model extracts rear vehicle shapes and searches if a similar vehicle from the video frame images has already appeared over the camera network. This is performed solely based on vehicle shape appearance; whereas, in the second step, we extract the vehicle with plates on a small region and feed them onto the plate stream DCNN. However, the detection and reading of plates on low resolutions became more challenging in the intra-class similarities, viewpoint changes, and inconsistent environmental conditions. We, therefore, introduced and integrated the OCR stream to minimize these problems. Then, we agglutinate the streams to obtain more comprehensive vehicle information for distinguishing, recognizing, and associating the vehicle's tracklets across non-overlapping cameras. Additionally, we assumed that V = {V 1... V n }C i R p should represent the collection of n vehicles to be recognized and classified. Furthermore, we merged the features' distance descriptor vectors and independently shared the weights within each stream. We then applied the Euclidean method to calculate the similarity based on the distance between the license plates of the query image and the multiple cameras' video frame gallery images. Additionally, we combined the three streams' extracted features and computed the similarities between vehicle images. Finally, we added these to the convolutional fully connected layers and activated the softmax function for vehicle matching and classifications.
By implementing this strategy, our algorithm staunchly recognizes and classifies the vehicle { } based on shapes, plates, and OCR. Furthermore, it located the vehicle in every class{ }. As a result, we rewrite the formulas as the following: where denotes vehicle features, = { ... } represents a subset of the vehicle sample, and = { ... } denotes the sunset of the vehicle class sample.

Data Collection and Preparation Process
We used public datasets (VeRi 776 and vehicle rear view (VRV)), which are a collection of video sequences captured with non-overlapping cameras.  Figure 2). However, the images extracted from the sequential video frames are treated interchangeably as simple input sources that are fed to the Ternion stream DCNN. Then the special vehicle features such as shape, license plate, and low-resolution characters are recognized simultaneously and extracted from the input. Additionally, morphological operations and segmentation techniques are used to remove background noise. To improve vehicle detection and readability for small areas, the OCR stream has been added as a third stream to the two streams, i.e., shapes and plate streams. The Ternion streams DCNN is then used to independently extract the unique vehicle characteristics. However, to create an efficient framework, we shared the parameters among the stream networks and considered implementing the ROI to generate appropriate bounding frames. As a result, the vehicle track- Additionally, we combined the three streams' extracted features and computed the similarities between vehicle images. Finally, we added these to the convolutional fully connected layers and activated the softmax function for vehicle matching and classifications.
By implementing this strategy, our algorithm staunchly recognizes and classifies the vehicle {V i V} based on shapes, plates, and OCR. Furthermore, it located the vehicle in every class {C i C}. As a result, we rewrite the formulas as the following: where p denotes vehicle features, V = {V 1... V n } represents a subset of the vehicle sample, and C = {C 1... C n } denotes the sunset of the vehicle class sample.

Data Collection and Preparation Process
We used public datasets (VeRi 776 and vehicle rear view (VRV)), which are a collection of video sequences captured with non-overlapping cameras. The videos V vid = V vid 1 . . . V vid N are the input source and have been converted to frames f r = { f r 1 . . . f r N }, which are further subdivided into images Img = {Img 1 . . . Img N } (see Figure 2). However, the images extracted from the sequential video frames are treated interchangeably as simple input sources that are fed to the Ternion stream DCNN. Then the special vehicle features such as shape, license plate, and low-resolution characters are recognized simultaneously and extracted from the input. Additionally, morphological operations and segmentation techniques are used to remove background noise. To improve vehicle detection and readability for small areas, the OCR stream has been added as a third stream to the two streams, i.e., shapes and plate streams. The Ternion streams DCNN is then used to independently extract the unique vehicle characteristics. However, to create an efficient framework, we shared the parameters among the stream networks and considered implementing the ROI to generate appropriate bounding frames. As a result, the vehicle tracking centroids and Euclidean distance methods were used in the calculation of the distance and movement of the vehicle from frame to frame. Moreover, we calculated the similarities between frames and estimated the loss by feeding the last layers of the framework to the contrastive loss function as the following: where D w denotes the Euclidean distance between the outputs of the Ternion stream DCNN framework. and movement of the vehicle from frame to frame. Moreover, we calculated the similarities between frames and estimated the loss by feeding the last layers of the framework to the contrastive loss function as the following: where denotes the Euclidean distance between the outputs of the Ternion stream DCNN framework.
Thus, we further expressed as: where, denotes the output of the framework, represents the margin value, and Y values indicate if inputs are from the same class.

Appearance Features' Learning and Handling
Currently, existing vehicle Re-ID methods typically only extract the global appearance characteristics. However, since vehicles of the same make and model have a similar global appearance, extracting only global features for vehicle re-identification makes it difficult to distinguish between them. Consequently, both global and local attributes are crucial for improving the feature representation, discrimination, and robustness of vehicle algorithms. We, therefore, propose using both global and local inputs to build more discriminatory representations of the vehicles. Additionally, we introduce the three-stream DCNN frameworks, where the built-in OCR serves as a third stream to process the lowresolution and illegible plate characteristics. The proposed algorithm inputs the images from sequential frames and divides them into regions, which are fed to the framework for predicting bounding boxes and ROI likelihood values. However, each stream is applied Thus, we further expressed D w as: where, G w denotes the output of the framework, m represents the margin value, and Y values indicate if inputs are from the same class.

Appearance Features' Learning and Handling
Currently, existing vehicle Re-ID methods typically only extract the global appearance characteristics. However, since vehicles of the same make and model have a similar global appearance, extracting only global features for vehicle re-identification makes it difficult to distinguish between them. Consequently, both global and local attributes are crucial for improving the feature representation, discrimination, and robustness of vehicle algorithms. We, therefore, propose using both global and local inputs to build more discriminatory representations of the vehicles. Additionally, we introduce the three-stream DCNN frameworks, where the built-in OCR serves as a third stream to process the low- resolution and illegible plate characteristics. The proposed algorithm inputs the images from sequential frames and divides them into regions, which are fed to the framework for predicting bounding boxes and ROI likelihood values. However, each stream is applied simultaneously to extract vehicle features and to generate discriminators independently with separate weights. Additionally, we flattened the features from the ROI pooling layer into a vector and fed the final fully connected layers into the frameworks. Finally, the feature vectors of the streams were concatenated to form significant unique information for vehicle detection, tracking, Re-ID, and matching.

License Plate-Based Vehicle Re-ID
Depending on the distance from the cameras, multiple plate detection and recognition are frequently available in chaotic conditions. Therefore, due to illumination, viewing angle, and occlusion, this makes it challenging. Additionally, it has been proven to facilitate feature isolation during the segmentation process; to tackle these issues, several algorithms have emerged and rely on template-matching implementation. The template matching technique, however, was ineffective and did not offer a durable answer. However, to validate and match license plates, we suggest using the plate stream neural network, and from the pairings of license plates, the stream is intended to generate robust discriminative feature vectors. Moreover, to make the plate symbols and characters easier to read, it is also combined with the parallel neural network OCR stream. The two neural network streams are continuously and consistently trained to match the output pixel images. This improved our method as it became possible to design features resistant to geometric distortions in the query image and to learn the best shift-invariant local feature detectors. To ensure that the distance between license plate pairs of the same vehicle is small and the distance between license plate pairs of different vehicles is large, we also calculated the Euclidean distance between the feature vectors.
This problem is presented and solved mathematically, as follows: supposing 1 and 2 are input pairs of vehicle ν license plates, ς is a binary label of the pair Θ. The Euclidean distance is then calculated as follows: where w denotes the weights of the convolutional neural network, and ζ w ( 1 ) and ζ w 2 ) represent features extracted from 1 and 2 images of the vehicle license plates, respectively. Moreover, we defined the contrastive loss as the following: where (ς, 1 , 2 ) i is the ith tuple of training vehicle license plates. L c is the partial loss function for the same license plate and L c is the loss function for different license plates.

Experiments Settings Implementation Details
We experimented the algorithm on a Core Intel i7 Machine with 64GB DRAM and NVidia Titan XP GPU, running the Ubuntu 16.04 LTS operating system. Furthermore, we developed the entire framework in Python and, most of the time, the libraries used were NumPy, matpotlib, sci-kit-learn, and SciPy.
Training Settings: The first experiment was performed using the Rear Vehicle Public Dataset (VRV) consisting of ten videos (i.e., 56,028 vehicles, including motorcycles and buses) with a resolution of 1902 × 1080 pixels, recorded by different cameras with a 20 min duration. Then, we set the parameters with the learning rate (γ = 0.0001) and 500 iterations. We trained the framework 80% (8 videos) and the rest (20% = 2 videos) was for the test phase. Then, in the second experiment, we used the VeRi 776 dataset, which consists of 776 vehicles with a total of 50,000 images captured by 20 different cameras. Additionally, 2 to 18 cameras with different lighting, resolution, and occlusion record each vehicle in the data sets. In both datasets, the videos are split and converted into frames that are used to extract images cropped to 96 × 96 pixels and input to the network (see Figure 3) for the DCNN architecture structure. to extract images cropped to 96 × 96 pixels and input to the network (see Figure 3) for the DCNN architecture structure. Testing Settings: We simulated the real-world problem in real-time by adding the parameters and multiplying the false negatives in the test sample set (20%). Then, we apply an appearance-based stream to both datasets independently and examine the license plate validation for a specific vehicle search. We've boosted the plate stream with OCR to improve the readability of plate characters under lighting variations. Then, several convolutional neural network layers were shared and feature maps were generated, which were passed to the shape stream, the plate stream, and the OCR to generate more discriminating features. The extracted feature map was further divided into local areas of the vehicle, and each area part was embedded in the pooling layer and the fully connected layers to generate descriptive feature vectors. Finally, the fully connected level of the combined stream features was passed to the attribute chaining level and softmax for matches. Additionally, we evaluated and compared the algorithm performance based on the metrics accuracy, accuracy (P), recall (R), and F scores. These metrics are expressed mathematically, as follows: Recall = TruePositives (TruePositives + FalseNegatives) where and denote precision and recall, respectively. Accuray = TruePositive + TrueNegative TruePositive + TrueNegative + FalsePositive + FalseNegative (8)

Results
In this section, we present the training and validation results of our proposed algorithm (TSDCNN). We start with the performance of vehicle detection and matchings. Thus, Figure 4a-e illustrates the overall detection rate performance and effectiveness of our algorithm's detector and vehicle appearance associations for both datasets. Testing Settings: We simulated the real-world problem in real-time by adding the parameters and multiplying the false negatives in the test sample set (20%). Then, we apply an appearance-based stream to both datasets independently and examine the license plate validation for a specific vehicle search. We've boosted the plate stream with OCR to improve the readability of plate characters under lighting variations. Then, several convolutional neural network layers were shared and feature maps were generated, which were passed to the shape stream, the plate stream, and the OCR to generate more discriminating features. The extracted feature map was further divided into local areas of the vehicle, and each area part was embedded in the pooling layer and the fully connected layers to generate descriptive feature vectors. Finally, the fully connected level of the combined stream features was passed to the attribute chaining level and softmax for matches. Additionally, we evaluated and compared the algorithm performance based on the metrics accuracy, accuracy (P), recall (R), and F scores. These metrics are expressed mathematically, as follows: Precision = TruePositives (TruePositives + FalsePositives) where P and R denote precision and recall, respectively. Accuray = TruePositive + TrueNegative TruePositive + TrueNegative + FalsePositive + FalseNegative (9)

Results
In this section, we present the training and validation results of our proposed algorithm (TSDCNN). We start with the performance of vehicle detection and matchings. Thus, Figure 4a-e illustrates the overall detection rate performance and effectiveness of our algorithm's detector and vehicle appearance associations for both datasets.  In other instances, our algorithm reacted to new vehicle entries that entered and exited scenes through the camera's left acute angle views. The detector further checked whether the entry of the new vehicle fell within the acceptance range of the trajectories. The error between the actual observation, the appearance similarity, and the predicted observation was normalized using the Euclidean method calculation. Furthermore, this confirmed the vehicle correlations and resulted in satisfactory precision, recall, and F score measurements during the training and validation processes. Therefore, these results are presented in Tables 1 and 2 and visualized in Figures 5-8. Furthermore, they are analyzed and discussed in the Results Analysis section. In other instances, our algorithm reacted to new vehicle entries that entered and exited scenes through the camera's left acute angle views. The detector further checked whether the entry of the new vehicle fell within the acceptance range of the trajectories. The error between the actual observation, the appearance similarity, and the predicted observation was normalized using the Euclidean method calculation. Furthermore, this confirmed the vehicle correlations and resulted in satisfactory precision, recall, and F score measurements during the training and validation processes. Therefore, these results are presented in Tables 1 and 2 and visualized in Figures 5-8. Furthermore, they are analyzed and discussed in the Results Analysis section.

Results Analysis
In this section, we analyze our TSDCNN algorithm's results obtained from the experimental data. We trained and evaluated our detector and classifier based on a couple of combined vehicle characteristics descriptors using the VRV and VeRi776 datasets for real-time multi-vehicle tracking. The vehicle detection quality and matching are shown in Figure 4, whereas in Figures 5-8, we illustrate the overall performance and effectiveness of our algorithm's detector for both the training and validation phase.
The algorithm has proven effective with high performance in precision and recall, as shown in Tables 1 and 2. This demonstrates that it could be trusted to accurately detect, learn, match and correctly classify the vehicle area of interest based on various features. However, through this process, the algorithm had challenges with the non-representative data at the beginning of the training and testing phases but gradually converged well with more training epochs. This is shown in Figures 5 and 7, where the learning plots begin with difficulties in jumping up and down in statistics values due to vehicle appearance attributes and variations in entry/exit angle views in both datasets. Hence, this led to the poor recognition quality rates illustrated in Figure 4c,d and contributed to the high number of false positive classifications and mismatches, as clearly advocated in Figure 4a,b. This is highlighted in Figures 6 and 8, where the algorithm's training losses and gains at 500 and epochs project performance well on both the VRV and VeRi776 datasets. However, the algorithm showed better performance on VeRi 776 scenes, which had more difficult challenges, such as vehicle shape rotation with 225° multiple views compared to VRV datasets. As shown in Figure 4c, for the most part, the algorithm learned, identified matches, and effectively tracked all of the different vehicle shapes under various difficult conditions. This proves the robustness of our algorithms against different strong lighting and oblique viewing angles. However, despite the improved performance, Figures 5 and  7 show that there were still comparable issues with the unrepresentative data, primarily during the training and testing phases for both datasets. Furthermore, it quickly converged better at epoch 270, and this showed that our algorithm had enough data during the training phase, although initially, there seemed to be problems with not getting enough data at the beginning of the training. Therefore, the bouncing [34] could be due to data fitting, as we can see that as we trained the algorithm with more epochs, we got better and more stable results for both datasets.

Results Analysis
In this section, we analyze our TSDCNN algorithm's results obtained from the experimental data. We trained and evaluated our detector and classifier based on a couple of combined vehicle characteristics descriptors using the VRV and VeRi776 datasets for real-time multi-vehicle tracking. The vehicle detection quality and matching are shown in Figure 4, whereas in Figures 5-8, we illustrate the overall performance and effectiveness of our algorithm's detector for both the training and validation phase.
The algorithm has proven effective with high performance in precision and recall, as shown in Tables 1 and 2. This demonstrates that it could be trusted to accurately detect, learn, match and correctly classify the vehicle area of interest based on various features. However, through this process, the algorithm had challenges with the non-representative data at the beginning of the training and testing phases but gradually converged well with more training epochs. This is shown in Figures 5 and 7, where the learning plots begin with difficulties in jumping up and down in statistics values due to vehicle appearance attributes and variations in entry/exit angle views in both datasets. Hence, this led to the poor recognition quality rates illustrated in Figure 4c,d and contributed to the high number of false positive classifications and mismatches, as clearly advocated in Figure 4a,b. This is highlighted in Figures 6 and 8, where the algorithm's training losses and gains at 500 and epochs project performance well on both the VRV and VeRi776 datasets. However, the algorithm showed better performance on VeRi 776 scenes, which had more difficult challenges, such as vehicle shape rotation with 225 • multiple views compared to VRV datasets. As shown in Figure 4c, for the most part, the algorithm learned, identified matches, and effectively tracked all of the different vehicle shapes under various difficult conditions. This proves the robustness of our algorithms against different strong lighting and oblique viewing angles. However, despite the improved performance, Figures 5 and 7 show that there were still comparable issues with the unrepresentative data, primarily during the training and testing phases for both datasets. Furthermore, it quickly converged better at epoch 270, and this showed that our algorithm had enough data during the training phase, although initially, there seemed to be problems with not getting enough data at the beginning of the training. Therefore, the bouncing [34] could be due to data fitting, as we can see that as we trained the algorithm with more epochs, we got better and more stable results for both datasets.

Performance Comparison with State-of-the-Art Methods
To demonstrate the overall accuracy performance and specificity of our algorithms for the VRV dataset and VeRi776, we compared our precision results to state-of-the-art paradigms. The results are summarized in Tables 3 and 4, respectively. Therefore, from Tables 3 and 4, it was observed that our strategy outperformed the current online state-of-the-art paradigm by 0.28% and 1.70%, respectively. This proved that the proposed approach training procedure was more convenient for real-time vehicle tracking than many other methods presented in Tables 3 and 4. However, as illustrated in Figure 4c,d it faced challenges with angle view rotations, and as a result suffered from overlapping detection boxes, misdetection, and vehicle re-identification.

Conclusions
In this paper, we presented a deep convolutional network learning based on a three streams approach for real-time vehicle tracking and re-identification (Re-ID) problems. The study was conducted on public vehicle datasets with multiple views for vehicle detection, identification, and tracking, based on the combined three streams' descriptors, such as shape, plate, and OCR. This deep neural network model extracted both the global and local visual features that were more robust, representing the vehicles' characteristics. Specifically, the stream is effectively used to extract the shape and plate. Additionally, we enhanced the license plate stream verification technique with the integration of OCR for low-resolution character reading and then used the deep neural network to calculate the similarity appearance and trajectories with the Euclidean method. This improved the overall results and proved that the proposed technique produces better detection rates and data associations. However, our algorithm experienced a poor detection rate on fastmoving vehicles. Therefore, our future work will involve implementing the algorithm for tracking multiple fast-moving vehicles on a huge dataset with a 360 • angle view.