1. Introduction
Multiple Object Tracking (MOT) is one of the most fundamental problems that have been addressed in computer vision and robotics. Tracking is an important building block in various tasks of computer vision such as surveillance [
1], autonomous driving and advanced driver assistance systems [
2], or industrial inspection [
3]. Even though it has attracted the interest of many researchers over several decades, the problem of multiple object tracking has not yet been solved. Many of the MOT methods follow a track by detection framework where the tracking solution generally employs an object detector to identify objects in each frame and then utilizes an association method between detections and tracks, in order to maintain their identity over all frames from a given image sequence. MOT can be separated into Online and Offline tracking methods according to how they use object detection information in the image sequence. Offline methods [
1,
4] handle the tracking problem as a global optimization problem and make use of all detections available from the whole image sequence when associating unique track identities to these detections. Therefore, offline methods can only be applied when the whole image sequence is present. In contrast, online methods are more suitable for real-time applications since they rely on the information from object detection up to the current frame. These real-time solutions have also shown competitive tracking accuracy on international benchmarks [
5,
6].
The challenges that appear in multi-object tracking can be split in two main categories: sensor-related issues and data association problems. Some of the thermal sensor issues may refer to:
- -
The number of objects within the field of view (FOV) of the sensor, which may be unknown and in different states.
- -
Objects enter and leave the sensor FOV; therefore, it is necessary to have good object management and object identity management.
- -
Since the object detector is not perfect, it may be susceptible to two kinds of errors, missed detections (due to environment conditions, object properties, or occlusions) and false detections or clutter (a detection that is not caused by an object). Both types of errors could lead to disastrous outcomes if they are not handled correctly.
The main idea of the data association problem is that there is no information regarding the origin of a detection or what real object caused it. Hence, we can split the challenges for treating the data association problem into two categories:
- -
The origin uncertainty: There is no knowledge about how the new measurements relate to previous sensor data, and
- -
Motion uncertainty: Objects can have multiple motion patterns, which may change in consecutive frames.
The poor handling of the data association problem may lead to bad tracking results. The issues mentioned above were approached by many researchers who have addressed the tracking problem for different kinds of applications using different types of sensors such as single cameras [
7], stereo cameras [
8], LIDARs, RADARs [
9], or thermal cameras [
10]. Some solutions from the literature try to improve the performance of object tracking by fusing the information from multiple sensors [
9].
Even so, to ensure high-quality results and robustness against individual sensor failure, the tracking functionality must be reliable and the solution must not be centered around the functioning of a certain sensor.
Thermal cameras have attracted a lot of attention in the automotive field due to their ability to detect objects in bad weather conditions including rainy, snowy, or foggy weather. Other advantages of thermal cameras include their ability to function without a light source, the lack of saturation in the presence of the lights from oncoming vehicles, and the ability to detect people or animals from long ranges even at night, improving the reaction time of the driver. The main disadvantages of thermal images are that they do not contain as much information as the color or even the monochrome images, and they usually have a lower resolution, which makes the design of a data association function based on appearance even more difficult. There are two main directions in the literature for addressing this issue of data association: the feature engineering approaches [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21] and the data-driven methods (using convolutional neural networks) [
22,
23,
24,
25,
26,
27,
28,
29,
30,
31]. The advantage of designing the data association function using data-driven methods is that after the convolutional neural network architecture is designed, through a learning process the best features are identified. The main issue with deep learning and with data-based models in general is that the object tracker may get latched onto the wrong object, which may be a false detection but looks similar to data from the training dataset, and never recover. Furthermore, if the data association model is not trained on parts of objects, the tracker can have a hard time tracking an object when it is partially occluded.
In contrast to data-driven methods, in the feature engineering-based solutions the researchers manually design features and cost functions and use an optimization method [
13] to assign the best measurement to each track. The difficulty in this approach is identifying the best features to use for each type of sensor. Feature engineering methods are faster than data-driven solutions; however, identifying the correct features to use depending on the sensor is a more difficult endeavor.
In this paper we present a data association and tracking solution for thermal images that exploits the benefits of both approaches. The proposed tracker was designed to track pedestrians in thermal images related to traffic scenarios. The contributions of this paper are the following.
We designed a family of five Siamese Convolutional Neural Networks that were combined to create a data-driven, appearance-based association score capable of working even in the case of partial occlusions. The base architecture of all neural nets is similar and its design is also a contribution of the current paper.
We proposed a uniform, local binary pattern descriptor obtained from edge orientations. This engineered feature will be used to compute a similarity score between measurements and tracks. The number will be included in the data association score to provide adaptability to unknown scenarios.
The creation of the dataset is useful for training a CNN when designing an appearance data association function for tracking pedestrians in thermal images. The dataset is made publicly available.
The data-driven and feature engineered scores were merged using a weighted combination and the resulting number was used to perform a successful data association and track objects.
The rest of the paper is structured as follows. In
Section 2 we present the state of the art. In
Section 3 we describe the proposed contributions. In
Section 4 we illustrate the performance of the proposed solution, and in
Section 5 we conclude the paper.
4. Results
The proposed tracking framework was implemented using C++ and Python, and all test cases presented in this section were done on a computer having an Intel i7-4770 K CPU with 3.5-GHz frequency and 8 GB of RAM memory and the GPU used was NVIDIA GeForce GTX 1080 Ti. The designed tracker was able to track pedestrians having an average running time on the CPU and GPU of 25 ms (without the object detection part). The proposed data-driven score was implemented on the GPU, while the feature engineered score was implemented on the CPU.
For training the neural networks, the proposed dataset, presented in
Section 3.2.4, was used. Furthermore, the original dataset was augmented using the following operations: image flip, adding salt and pepper noise in the image, addition of motion blur, addition of gaussian noise, image sharpening, and contrast normalization. The resulting dataset was split for training the proposed neural network architectures in the following way: 20% test data, 10% cross-validation data, and 70% training data. Each model was trained for 40 epochs using a learning rate of 0.0005 and the optimizer used was root mean square propagation. The results of the proposed models on the test sets were the following: 98.34% for the model working on the entire image, 96.82% for the neural network working on the top left image part, 96.61% for the neural network model working on the top right part of the image, 95.92% for the bottom left part, and 96.01% for the bottom right part. The object detector employed in our solution was a YOLO [
36]-based detector, which was trained on the FLIR-ADAS [
37] dataset and fine-tuned on the CrossIR [
21] dataset obtained with a PathFindIR thermal camera. The CrossIR dataset contains images taken in various light conditions (day and night) and different weather conditions (sunny, rainy, foggy) and temperature conditions (cold and warm).
We compared the performance of the proposed tracker with other state-of-the-art solutions using the PTB-TIR benchmark [
38]. In this dataset, there are multiple image sequences acquired using a thermal camera, each having manual annotations. One comparison metric used in this dataset was the center location error (CLE), which is defined as an average Euclidean distance between the object position and ground truth position for that object. If the CLE is within a given threshold (20 pixels on the PTB-TIR benchmark), the track is said to be successful at that frame. Furthermore, the benchmark also offers results from multiple types of trackers on the given sequences such that the advantages and disadvantages of each method can be studied comparatively. In the evaluation of the proposed tracker on the PTB-TIR benchmark, we included only the sequences that were acquired from a vehicle-mounted camera, since the target application of our solution was related to intelligent vehicles. The evaluation result of the proposed solution with respect to the CLE metric on the all the automotive sequences from the benchmark is displayed in the precision plot in
Figure 9. The numerical results and plots from both
Figure 9 and
Figure 10 were obtained using the PTB-TIR Evaluation Toolkit, which is presented in detail in [
38].
For better visibility. the values illustrated in
Figure 9 are also displayed in
Table 3.
Another interesting score that the PTB-TIR benchmark provided was the overlap score, which measures the overlap ratio between the bounding box area of the tracked object and the ground truth. The tracking is labelled successful at that frame if the overlap score is above a threshold. The success plot is used to rank the tracks with respect to their overlapping score at the threshold varying from 0 to 1. In
Figure 10, the success plot is displayed.
In contrast to the top solutions from this benchmark, our method was designed keeping in mind the constraints of the automotive field. The proposed solution was able to track objects even in occluded scenarios, and in the case of an unknown environment situation, which was not present in the training set, the method was able to track the object detections. Moreover, the proposed approach was able to perform multiple-object tracking not just single-object tracking.
Furthermore, the proposed solution is not very complicated to reproduce, does not require huge amounts of data for training, and can be easily augmented with other features.
Additionally to the evaluation metrics presented above, we also evaluated the proposed solution using the MOTA (multi-object tracking accuracy) and MOTP (multi-object tracking precision) metrics. The equation for the MOTA is presented in Equation (12) and for MOTP in Equation (13).
The MOTA metric serves as a general error rate for trackers that takes into account all object configuration errors that were made by the tracker, like false positives, misses, mismatches, and over all frames. The maximum MOTA achievable is 1, which would indicate that a tracker has no errors. The second metric, MOTP, evaluates the precision of the bounding boxes. Between all track hypotheses and ground truth bounding boxes a distance metric is computed and divided by the number of matched objects to compute an average precision. These values are then summed over all frames from the testing sequence to compute the MOTP. The essential difference between the two metrics is that MOTP takes into account bounding box accuracy over time for tracked and matched objects, while MOTA summarizes tracking errors over time, including tracks that go unmatched. An IDSW(id switch) occurs when a track is lost and re-initialized with a new id or when the object identity is incorrectly swapped because of a wrong track and detection association. In
Table 5 we illustrate the evaluation using the MOTA, MOTP, and IDSW of the proposed tracker in the context of multiple pedestrian tracking on the CrossIR dataset [
21].
The proposed solution was able to accurately associate detections to tracks and perform multiple pedestrians’ tracking in thermal images regardless of the weather conditions or if the object became occluded. By combining the data-driven and feature engineered scores, we ensured that the tracker could adapt to unknown traffic situations, thus becoming more robust.
To illustrate how much the proposed tracker improves the detection process, we will define several metrics. We say that an object is correctly identified if its position differs from the position of the ground truth with at most 10 pixels (on the x or y axis). Precision is defined as the number of correctly identified objects divided by the number of total objects from the ground truth for a frame. Recall is the number of correctly identified objects divided by the number of total detected objects for that frame. The accuracy of the tracker and detector is defined as the number of correctly identified objects reported to the number of total objects from the ground truth. The detector and the tracker were evaluated on over 100 sequences having multiple objects, which contained different weather and lighting conditions obtained from real traffic scenarios.
The evaluation presented in
Table 5 was performed on the CrossIR dataset introduced in [
21]. We performed this evaluation to illustrate the performance of the proposed algorithm in the presence of multiple objects, in various weather conditions. It is a known fact that object detectors may fail to detect some objects when they are occluded or because of the accuracy of the detector. In this evaluation we aimed to illustrate the fact that the object tracking is improving the overall detection of pedestrians, being able to maintain an identified object even when the object detector is not able to accurately identify a pedestrian.
The comparative evaluations are presented in
Table 6. The proposed method was built upon the base solution presented in [
21]. In
Table 6, we made an ablation study and show the performance of the base solution and each of the proposed contributions individually. We also illustrate the fact that the results obtained using the fusion of the proposed data-driven and the feature engineered costs, added to the base solution, improve the tracking performance in all the metrics presented below.
As can be seen, the proposed solution improved the performance of the object detector, leading to better overall results. Furthermore, it is worth noting that the feature engineered score can also be applied to other object classes, such as vehicles; but, illustrating this was out of the scope of the paper.