Sort and Deep-SORT Based Multi-Object Tracking for Mobile Robotics: Evaluation with New Data Association Metrics

: Multi-Object Tracking (MOT) techniques have been under continuous research and increasingly applied in a diverse range of tasks. One area in particular concerns its application in navigation tasks of assistive mobile robots, with the aim to increase the mobility and autonomy of people suffering from mobility decay, or severe motor impairments, due to muscular, neurological, or osteoarticular decay. Therefore, in this work, having in view navigation tasks for assistive mobile robots, an evaluation study of two MOTs by detection algorithms, SORT and Deep-SORT, is presented. To improve the data association of both methods, which are solved as a linear assignment problem with a generated cost matrix, a set of new object tracking data association cost matrices based on intersection over union, Euclidean distances, and bounding box metrics is proposed. For the evaluation of the MOT by detection in a real-time pipeline, the YOLOv3 is used to detect and classify the objects available on images. In addition, to perform the proposed evaluation aiming at assistive platforms, the ISR Tracking dataset, which represents the object conditions under which real robotic platforms may navigate, is presented. Experimental evaluations were also carried out on the MOT17 dataset. Promising results were achieved by the proposed object tracking data association cost matrices, showing an improvement in the majority of the MOT evaluation metrics compared to the default data association cost matrix. In addition, promising frame rate values were attained by the pipeline composed of the detector and the tracking module.


Introduction
Vision-based Multi-Object Tracking (MOT) methods analyze image sequences to establish object correspondences over the images [1,2]. Multiple MOT methods have been proposed over the years and have been widely used in applications, such as surveillance [3], traffic monitoring [4], autonomous driving [5], and mobile robot navigation, including object collision avoidance [6] or target following [7]. However, MOT results may be affected by difficult problem configurations due to crowded environments or occluded objects, which leads to limitations in performance for such scenarios. Moreover, due to a large number of applications where MOT methods can be applied, the importance of MOT is high and remains a challenging topic in the research community [1,2,8].
Throughout the years, MOT tasks were mainly performed by the tracking by detection paradigm [9], where objects were detected by an object detector and fed to the object tracking method, which then dealt with the object association between previous frames and the present one. Most methods proposed [10][11][12] use a Kalman Filter (KF) as a motion module to predict the position of objects of interest in the current frame. On the other hand, with the emergence of Deep learning-based Neural Networks (DNNs) [13,14], new state-of-the-art methods have been proposed in object vision-based tasks such as object classification [15], recognition [16], and tracking [11,17,18]. Therefore, to improve the object association step of tracking algorithms, Convolutional Neural Networks (CNNs) have been applied to extract object appearance features, which are used to compute similarity values between two objects' feature maps, extracted over two consecutive images. On the other hand, CNNs have also been used to locate objects to track consecutive images [19,20].
MOT techniques can be employed to improve the motion planning behavior and safety on the navigation tasks of mobile robot platforms [6]. MOT techniques can also be an asset on assistive platforms for target-following tasks, where the platform follows a specific target (e.g., following a caregiver, reaching an object).
Due to several types of impairments, there are a significant number of people unable to perform daily tasks. Hence, a particular type of assistive mobile robot, robotic wheelchair platforms, has been researched aiming to increase the autonomy and mobility of such users [21,22]. Brain-actuated wheelchairs [21,23,24] have also received particular focus in research, with several promising techniques for severely motor disabled people who are unable to control a robotic platform by the conventional interfaces, such as joystick [21,25]. With the advances in Brain-Computer Interfaces (BCI) and shared control methods, new paradigms of the brain-computer interaction that allow the user to choose his navigation target have been proposed. The new paradigms can represent potential goals of interest to the user's navigation (e.g., objects) and can be empowered by considering the tracked objects from MOT methods. Once the user selects its navigation target, a MOT method is required to ensure that robotic wheelchair platforms navigate towards that specific target. However, to endow a mobile robot to pursue an object as its navigation target, a robust visual perception module, including an object tracking method, is required. Moreover, to ensure a robust object tracking performance, detection and tracking should be performed frame-by-frame, which is time-consuming and can lead to the inability of performing MOT in real-time [9].
In this work, considering navigation tasks in assistive platforms, an evaluation study of two multi-object tracking by detection algorithms, SORT [10] and Deep-SORT [11], using new data association metrics [26], is proposed. SORT and Deep-SORT methods were proposed with a focus on real-time object tracking tasks, both achieving state-of-the-art results with a high frame rate. The SORT and Deep-SORT methods share the same overall architecture, divided into three main modules, as shown in Figure 1: KF-based estimation, data association, and track management. To detect objects on the images, the YOLOv3 [16] network is used. Both methods use the KF algorithm to predict the position of the objects in the current frame, which are, as well as the object detections provided by the YOLOv3, the inputs of the data association module, which is a linear assignment problem with a cost matrix association. The SORT method associates objects using bounding box detections to match measurements with predicted tracks, using the overlap of bounding boxes. On the other hand, to improve the bounding box association step, the Deep-SORT uses a CNN to extract appearance features from the object bounding box images. For a detailed evaluation of the object tracking methods, a set of different types of data association cost matrices based on bounding boxes intersection over union, Euclidean distances, and bounding boxes ratios is proposed. To evaluate both tracking methods with the proposed cost matrices, considering an assistive robotics context, the ISR Tracking dataset is proposed. The dataset contains the object conditions from an assistive mobile robot's point of view. The dataset contains 329 object sequences of 9 different object classes. To complement the validation of the SORT and Deep-SORT methods with the proposed cost matrices, evaluation was also performed in the MOT17 [27] dataset.
The main contributions of this work can be summarized as follows: • Eight new object tracking data association cost matrix formulations based on intersection over union, Euclidean distances, and bounding boxes ratio are proposed. • The ISR Tracking dataset, presenting a mission performed by a mobile robot in a lab setting, represents the object conditions under which robotic platforms may navigate. It is a rearrangement of the ISR RGB-D dataset [28] with object tracking labels for multi-object tracking tasks.
• An evaluation, having in view navigation tasks for assistive mobile robot platforms, of two multi-object tracking by detection algorithms, SORT and Deep-SORT, is also presented. The proposed new data association cost matrices were integrated and evaluated on both tracking methods.

Object Tracking
Object tracking techniques have become a fundamental task in real-time video-based applications that require establishing object correspondences between frames [8]. In the literature, proposed tracking techniques fall in two main categories [29]: Single-Object Tracking (SOT) and MOT. In SOT approaches, the appearance of the single target is known a priori, while in MOT techniques, the aim is to estimate trajectories of multiple objects of one or more categories without any prior knowledge about their appearance or location targets. For MOT, an object detection step is required across frames [1]. According to [1], applying multiple SOT models to perform MOT tasks generally leads to poor performance, often caused by similarly looking intra-class objects.
Recent advances in MOT literature have been focusing on two different approaches: tracking by detection and joint tracking and detection. Tracking by detection [10][11][12]30], as presented in Figure 1, makes use of object detection algorithms to detect and classify objects before performing the object association. This approach simplifies the tracking task as an object association task over consecutive frames. Methods receive an array of measurements and output bounding boxes with their respective tracking ID. On the other hand, joint tracking and detection methods [9,17,19,20] are able to detect and track objects in a single model. Generally, this approach uses visual appearance features of the object to track and locate it in the frames of interest. Joint tracking and detection techniques have become widely popular due to the emergence of the deep learning-based Siamese Networks [18,31].
Despite the promising results achieved by the joint tracking and detection approaches, for navigation tasks in assistive mobile robot platforms, an object detector method can already be available to provide knowledge of the surrounding environment for motion planning or localization methods. Hence, for the purpose of this work, tracking by detection methods are more suitable.

Tracking by Detection
With the emergence of deep learning-based object detectors, tracking by detection has become the most popular approach in the MOT research community [2]. This approach takes the benefit of object location knowledge to generate an association model that would be able to associate objects over time. One of the first MOT methods found in the literature is Multiple Hypothesis Tracking [32], which calculates hypotheses over measurements to estimate if an object should be associated to a track, be considered as a new track, or if it is a mis-measurement. It uses the KF algorithm to estimate the object's states and a probabilistic distribution over hypotheses to associate measurements to tracks.
Recent works also employ the KF algorithm, as a motion model, to improve the association of objects over time [10][11][12]33]. Bewley et al. [10] proposed SORT, which is composed of a KF to estimate object states, and by the Hungarian [34] algorithm to associate the KF predictions with new object detections. A year later, Wojke et al. proposed an improvement of SORT, the Deep-SORT [11], by including a novel cascading association step that uses CNN-based object appearance features. The data association algorithm combines the similarity of the object appearance features with the Mahalanobis distance between object states and, at a later stage for unmatched states, uses the SORT's data association. Despite the usage of a CNN, the Deep-SORT method achieved a promising frame rate on the object tracking benchmarks. A method similar to Deep-SORT was proposed by Chen et al., the MOTDT [12]. MOTDT uses a fully CNN-based scoring function for an optimal selection of candidates. Euclidean distances between extracted object appearance features also are used to improve the association step. Recently, He et al. [33] proposed the GMT-CT algorithm that incorporates graph partitioning with deep feature learning. The graph was constructed through the extracted object appearance features, which was used in the association step to model the relationship between measurements and tracks with higher accuracy.
With the growth of deep learning-based Siamese networks in the object tracking community, a new paradigm has been proposed [1]. Lee et al. [35] introduced the FPNS-MOT, which integrates a Siamese architecture with a feature pyramid network [36]. It computes a similarity vector between features from two different inputs and then updates tracks using an interactive selection of the maximum scored pair of tracks and measurements. FPNS-MOT outperformed the aforementioned methods on the MOT challenge benchmarks [27] with an inference time of 10 Hz. Jin et al. [37] enhanced the performance of the Deep-SORT [11] object feature extractor with a Siamese architecture. In addition, it introduced optical flow [38] in the motion module, improving the object association accuracy.
In summary, Table 1 presents the main characteristics of the aforementioned tracking by detection MOT methods.

Tracking Applied in Mobile Robots
Object tracking techniques have been widely applied for navigation tasks in indoor mobile robot platforms, such as object collision avoidance [6], target following [7], and autonomous navigation [5]. Target detection and tracking have also been applied in robotic wheelchair platforms [39,40], which have been proposed to increase the mobility of people with motor impairments. Xiao et al. [39] proposed a visual-target detection and tracking method to detect and track people in the surroundings of an intelligent wheelchair. The visual tracking was implemented as a binary classification between the object and the background, and a semi-supervised online boosting approach was applied to solve the object drift problem. On the other hand, Lecrosnier et al. [40] proposed an advanced driver assistance system for a robotic wheelchair composed by the YOLOv3 [16] object detection algorithm and a 3D object tracking approach based on SORT [10] to detect and track doors and door handles. Deep-SORT related algorithm that uses Siamese network to process association tasks and also introduce optical flow information to the motion model, in order to improve accuracy.

Methodology
In this section, a brief review of the SORT and Deep-SORT methods is presented. The proposed cost matrix formulations, which are part of the data association's linear assignment problem, inside the Cost Matrix Matching module (see Data Association-Cost Matrix Matching in Figures 2 and 3), are also presented.

SORT
SORT [10] iteratively computes the state of the objects being tracked through a KF. The method uses the Hungarian algorithm [34] to accurately associate detected objects (by an object detector) with objects that are being tracked. A detailed overview of the SORT algorithm is represented in Figure 2.

KF Estimation
Previous State  The SORT Data Association module, which is of particular interest in this work, is responsible for matching the KF's predicted bounding boxes with measured bounding boxes on the image, given by the object detector. This module receives, as input, N detected bounding boxes and M predicted bounding boxes (acquired from their respective KF). The module formulates a linear assignment problem by computing a cost matrix between each detected bounding box and all predicted bounding boxes (respectively D i , i ∈ {1 . . . N} and P i , i ∈ {1 . . . M}), with the Intersection over Union (IoU) as metric:

Object Detector
where the IoU between a detected bounding box and a predicted bounding box is given by After computing the cost matrix, the Hungarian algorithm [34] is used to associate the bounding boxes. The obtained associations are represented in a N × 2 array, representing N measurements associated to N tracks. Associations are also filtered by considering a minimum IoU threshold, discarding associations with IoU lower than the threshold.
The KF Estimation module uses a linear constant velocity model to represent each object's motion model. When an object is associated with a tracked object (track), its bounding box is used to update the track state. If no object is associated with the track, then the track's state is only predicted. The Track management module is responsible for the creation and deletion of tracks. New Tracks are created when detections do not overlap or overlap with tracks below a minimum IoU threshold. The bounding box of the detection is used to initialize the KF state. Since the only data available are the object's bounding boxes, the object's velocity in the KF is set to zero and its covariance is set high to signal the uncertainty in the state. If a new track does not receive updates because it does not receive associations, or if a track stops receiving associations, they are deleted to avoid maintaining a high number of tracks to false positives or objects that left the scene, respectively.

Deep-SORT
Deep-SORT [11] is an improvement of the SORT algorithm, integrating appearance information of objects to enhance associations. Data association integrates an additional appearance metric based on pre-trained CNNs allowing re-identification of tracks, after a long period of occlusion. The KF Estimation and the Track management modules are similar to the corresponding SORT modules. An overview of the method is presented in Figure 3.
As in SORT, the association of detected bounding boxes to tracks is solved by the Hungarian algorithm, using a two-part matching cascade. In the first part, the Deep-SORT method uses motion and appearance metrics to associate valid tracks. The second part uses the same data association strategy as in SORT to associate unmatched and tentative tracks (recently created) with unmatched detections. Motion information is incorporated by the (squared) Mahalanobis distance between predicted states and detections. In addition to the metric computed with the Mahalanobis distance, a second metric based on the smallest cosine distance measures the distance between each track and each measurement appearance features. The appearance features are computed by a pre-trained CNN model. The CNN in the Deep-SORT method was trained on a large-scale person re-identification dataset [41] using deep cosine metric learning [42]. A pre-trained model is provided by the authors in their repository (https://github.com/nwojke/deep_SORT (accessed on 15 October 2021)).

Data Association-Cost Matrix Matching
In this work, eight cost matrix formulations (see Cost Matrix Matching in Figures 2 and 3) are proposed. As aforementioned, the data associations on the SORT, and also on the second stage of the Deep-SORT, are seen as a linear assignment problem represented by a cost matrix. Hence, the different approaches to formulate the cost matrices for the linear assignment problem with different bounding box metrics are presented.
Intersection over union quantitatively represents the overlapping between objects' bounding boxes, which, indirectly, ends up representing other types of information such as Euclidean distances between two different bounding boxes and bounding boxes ratios. However, in MOT problems, such information can be useful to improve the data association since, between two consecutive frames, it is expected that an object has similar bounding box dimensions and a small displacement. Therefore, object tracking data association cost matrix formulations based on intersection over union, Euclidean distances, and bounding boxes ratio are proposed. Let us consider a bounding box represented by the image coordinates of its center (u BB ,v BB ) and its height and width (h BB ,v BB ), the detection set D (with N bounding boxes), and the prediction set P (with M bounding boxes). The following cost matrix formulations are proposed:

1.
Euclidean distance based cost matrix (D E (D, P)): which represents the distance between bounding box central points normalized into half of the image dimension. To formulate the problem as a maximization problem, to be solved using the Hungarian algorithm, the distance is obtained by the difference between 1 and the normalized Euclidean distance, as follows: where (h,w) are the height and width of the input image, D i is a bounding box from the detection set, and P i is a bounding box from the prediction set.

2.
Bounding box ratio based cost matrix (R(D, P))-implemented as a ratio between the product of each width and height: In addition, for boxes with similar shapes, this metric outcome with a value closer to 1 contrasts values close to 0 or much greater than 1 otherwise. For that reason, the minimum between the bounding box ratio and its inverse is applied, to get a value that is within the [0, 1] range. 3.
SORT's IoU cost matrix combined with the Euclidean distance cost matrix: where • represents the Hadamard product (element-wise product) between two matrices. 4.
SORT's IoU cost matrix combined with the box ratio based cost matrix:

5.
Euclidean distance cost matrix combined with the box ratio based cost matrix: 6. SORT's IoU cost matrix combined with the Euclidean distance cost matrix and the box ratio based cost matrix: 7. Element-wise average of every cost matrix (A(D, P)): 8. Element-wise weighted mean of every cost matrix value: To improve tracking performance in multi-class environments, cost matrices can be updated based on the match between predicted and detected object class (class gate):

ISR Tracking Dataset
The ISR RGB-D Dataset [28] is a non-object centric RGB-D dataset, recorded at the Institute of Systems and Robotics (ISR-UC) facilities using a camera sensor onboard the ISR-InterBot [43] mobile platform. The dataset presents a mission performed by the platform in a real scenario setting, representing object conditions under which mobile robot platforms may navigate. The ISR RGB-D dataset contains a total of 10,000 RGB-D raw images captured at 30 FPS with a resolution of 640 × 480. Moreover, ten object classes (unknown, person, laptop, tvmonitor, chair, toilet, sink, desk, door-open, and door-closed) were annotated at every fourth frame, reaching a total of 7832 object-centric images.
As aforementioned, the main goal of this work is to study and compare the KF-based SORT and Deep-SORT object tracking methods to be applied in real-time mobile robot applications. To pursue that goal, a dataset representing the object conditions from the mobile robot platform's point of view during their navigation tasks is required. Due to the lack of publicly available datasets for such requirements, the labels of ISR RGB-D Dataset (https://github.com/rmca16/ISR_RGB-D_Dataset (accessed on 15 October 2021)) were rearranged to be used as a multi-object tracking dataset, the ISR Tracking Dataset. First, the labels for the remaining images were annotated for the described ten object classes. Then, a unique tracking ID was associated with the same objects throughout the images, except for the "unknown" object class that was not considered for tracking tasks. However, if an object disappeared or was occluded for more than 15 frames, it was considered as a new object, and a new tracking ID was associated. Each image has an associated ".txt" file that contains all object labels for that image, and each object label is organized as follows: <object class>, <tracking ID>, <bounding box center x>, <bounding box center y>, <bounding box width>, and <bounding box height>. ISR Tracking dataset has in total 32,635 object bounding boxes and 329 object sequences.

Experiments
The proposed study was evaluated on the MOT17 [27] dataset and also on the proposed ISR Tracking dataset. Moreover, to evaluate the proposed approaches on the used KF-based algorithms, the following standard evaluation metrics [1]

Datasets
(1) MOT17 Dataset: It is a multi-person tracking benchmark dataset divided into 14 sequences with highly crowded scenarios, different viewpoints, weather conditions, camera motions, and indoor/outdoor environments. The dataset contains a public training/test split, where the training sequences have ground-truth files and detection files provided by three object detection state-of-the-art methods, while the test sequences just have the detection files. Hence, due to the scope of the performed experiments, and also due to the submission's constraints to obtain results on the test sequences, only the training sequences were used in this study. Since the multi-object tracking methods evaluated in this work do not require a training process, the training sequences were used as evaluation.
(2) ISR Tracking Dataset: It is composed of 10,000 RGB-D raw images acquired by an Intel RealSense D435 sensor onboard a mobile robot platform [43], representing the object conditions under which robotic platforms may navigate. Nine object classes were annotated for multi-object tracking tasks, achieving a total of 32,635 object bounding boxes and 329 object sequences. For evaluation, the ISR Tracking dataset was reorganized into two sub-datasets: ISR500 and ISR200. In the ISR500, the dataset was divided into sequences of 500 frames, which gives a total of 20 image sequences. On the other hand, the ISR200 contains 50 image sequences, which are the result of partitioning the dataset into sequences of 200 images. On both sub-datasets, the train/test image sequence split was performed by interleaving the sequences, i.e., the first sequence was used to train, the second sequence was used to test, the third sequence was used to train, and so on.

Implementation Details
All modules were implemented using the Python 3.8.5 programming language. Deep learning networks were also implemented using the PyTorch framework (version 1.8.0). YOLOv3 network was trained using an image size of 416 × 416, a fixed learning rate of 10 4 over 50 iterations, a mini-batch of 6 images, and the ADAM optimizer. In addition, the YOLOv3 weights were initialized using the COCO pre-trained model. To perform evaluations on the SORT method [10], the number of frames to hold a track without associations before deleting that track was set with T Lost = 1, the minimum number of object detections to start a new track was set with hit min = 3, and the minimum threshold value for bounding box association was set with th cost = 0.3. For the Deep-SORT, the following constant values were used: λ = 0 (hyperparameter to control the influence of each metric on the association cost), T Lost = 30, and an association gating threshold, dist 1 max = 0.2. Moreover, all experiments were performed using an Nvidia RTX 2060 super GPU, 32 GB RAM, and an AMD Ryzen 5 3600 CPU.

Results
The evaluation of the proposed work was divided as follows: evaluation of the SORT and Deep-SORT on both MOT17 and ISR Tracking datasets using all the available frames (ideal conditions); evaluation of the SORT and Deep-SORT on the ISR Tracking dataset skipping frames, representing real conditions when it is not possible to perform the default 30 FPS; and evaluation of the whole pipeline, YOLOv3 + object tracking method, evaluating also the influence that the YOLO's detection performance has on the object tracking method.

SORT and Deep-SORT Evaluation
The proposed W M data association cost matrix formulation requires the selection of three constant values (weights) to control the influence of each data association cost matrix. Table 2 shows the evaluation performed on the MOT17 dataset with different weight value combinations of the W M cost matrix. Based on the achieved results, the highest MOTA value was attained using: λ IoU = 7 10 , λ D E = 2 10 and λ R = 1 10 . The aforementioned weight configuration has the minimum number of FP and IDs, despite the higher number of FNs. Furthermore, it has the minimum number of FM by a large amount. Therefore, throughout the following evaluations, the aforementioned values were used for the W M cost matrix. The bold value highlights the best value on each column (in this case, each MOT evaluation metric). Table 3 shows the results achieved on the MOT17 dataset. Regarding the SORT's results, the highest MOTA result was obtained using the default IoU cost matrix, being a similar result achieved by the E IoU D , R IoU , M, and W M cost matrices. However, on the remaining evaluation metrics, the default IoU cost matrix was outperformed by the proposed cost matrices. The M cost matrix had the lowest number of FP, IDs, and FM, which represents the most accurate tracking for sequences generated by the SORT. The A cost matrix had the highest number of TP and the lowest number of FN, which is proportional to the percentage of MT sequences. For this work, which has in view mobile robot navigation tasks, those metrics could impact performance, as it can ensure that the object is successfully tracked until the object leaves the scene. Regarding the Deep-SORT results, the best MOTA result was achieved by the proposed W M cost matrix with 45.67%. The W M cost matrix reached the best results for the TP and MT evaluation metrics. The default IoU cost matrix achieved the best MOTP, FP, IDs, and FM results, which are very similar to the results attained by the proposed W M cost matrix. Overall, promising results were achieved by the proposed cost matrices, being able to outperform the default IoU cost matrix. Moreover, the Deep-SORT with W M cost matrix was able to obtain the highest MOTA and MT. Attained results show similar overall performances between SORT and Deep-SORT. However, as expected, SORT is much faster than Deep-SORT. The bold value highlights the best value on each column (in this case, each MOT evaluation metric).
An evaluation of SORT and Deep-SORT, where the data association threshold is modified, was also performed on the MOT17 dataset, whose results are presented in Figure 4. As expected, as the threshold value increased, the MOTA score decreased for the majority of the cost matrices. As observed, no threshold value was found to be suitable for all evaluated cost matrices. Hence, the best results were obtained using a threshold value of 0.3, which was thereafter used for all the evaluations. Table 4 presents the results attained on the ISR Tracking dataset. Due to the multi-class available on the ISR Tracking dataset, an evaluation on the SORT algorithm using and not using the class gate metric, to discard associations of objects with different object classes, was performed. Regarding the SORT's results, similar to the reported results on the MOT17 dataset, the proposed data association cost matrices outperformed the default IoU cost matrix. Moreover, the results of all evaluated data association cost matrices were slightly improved by using the class gate formulation, being able to reach the highest MOTA result with 91.02%. The A cost matrix using the class gate formulation was able to achieve the best result on the TP, FN, IDs, and MT with 29,785, 2799, 51, and 69.3%, respectively. The A data association cost matrix presents a significant improvement on the MT evaluation metric compared with the IoU cost matrix (61.7% to 69.3%), which can impact the performance of a mobile robot platform during navigation tasks. Regarding the Deep-SORT's results, once again, the proposed data association cost matrices outperformed the default IoU cost matrix. Moreover, the A cost matrix achieved the highest MOTA and MT values, while the E IoU D achieved the best TP, FN, IDs, and ML results. A significant improvement of the MT evaluation metric was attained with the A cost matrix. Overall, on both SORT and Deep-SORT algorithms, the proposed data association cost matrices outperformed the default IoU cost matrix. The A cost matrix achieved the highest values on MOTA and MT evaluation metrics, showing that it could be the most suitable data association cost matrix to use. Regarding those evaluation metrics, the Deep-SORT outperformed the SORT algorithm, with a highlight on the MT evaluation metric (69.3% to 78.7%). Moreover, promising results were reached by both methods on the ISR Tracking dataset.  For the following evaluations, based on the reported results, only the following data association cost matrices using the class gate metric were used: IoU, A, and W M on the SORT algorithm and the IoU, E IoU D , and A on the Deep-SORT algorithm.  The bold value highlights the best value on each column (in this case, each MOT evaluation metric).

SORT and Deep-SORT on Skipped Frames
In real scenarios, sometimes due to hardware constraints, it is not always possible (or needed) to run the algorithms at 30 FPS, which is a standard value on image acquisition from cameras. Hence, to evaluate the tracking performance on such conditions, experiments by skipping 1, 2, and 3 images, representing an image acquisition at 15, 10, and 7.5 FPS, respectively, were performed. Table 5 shows the SORT and Deep-SORT results attained on the ISR Tracking dataset using non-consecutive frames. As expected, as the image gap increased, the object tracking performance decreased. This happens due to a greater displacement of the objects, which increases the difficulty in predicting and associating objects. Nevertheless, promising results were achieved by the proposed A data association metric on both tracking methods, outperforming the default IoU data association metric, especially on MOTA, IDs, and MT evaluation metrics. The best overall performance was reached by the SORT method with the proposed A data association cost matrix with an accuracy of 86.43% and 58.2% of mostly tracked object sequences. The SORT method using the IoU data association metric attained the best results on the MOTP and FP evaluation metrics, while the Deep-SORT method with the proposed A data association metric achieved the best results on the TP and FN evaluation metrics. Note that, in these conditions, a significant improvement was achieved by the A data association cost matrix compared to the IoU association metric, showing its capacity to hold the object track.

Detection-Based MOT Pipeline
To evaluate the performance of the SORT and Deep-SORT object tracking methods in real scenarios, an evaluation using the YOLOv3 object detector algorithm feeding the tracking methods was performed. Moreover, to also evaluate the influence that the object detector performance may have over the object tracking performance, four YOLOv3 models with different performances were used. The four YOLOv3 models were trained on the same data (ISR RGB-D Dataset), and on the same conditions, varying only the number of training epochs. Each used YOLOv3 model has the following mean average precision: Y M1,...,M4 = {38%, 60%, 80%, 90%}. Table 6 presents the detection-based MOT pipeline results achieved on the ISR200 sub-dataset. As expected, the YOLOv3's performance had a significant role in the overall pipeline. As the YOLOv3 performance increased, the object tracking performance also increased. In the case of a poor YOLOv3's performance, the number of FN was so high, especially on the Deep-SORT method, which achieved a negative accuracy (MOTA). Regardless of the YOLOv3's performance, in these conditions, SORT outperformed the Deep-SORT method. Moreover, the three data association cost matrices used on the SORT method reached similar results, being the default IoU cost matrix able to achieve the best MOTA and FP results, while the A data association metric got the best MT values. Note that using an object detector may introduce additional errors to the object tracking pipeline, such as incorrect detections, shifted detections, miss detections, and wrong object classification. This can be observed by the obtained values of TP, FP, FN, and IDs, which directly influence the remaining evaluation metrics. As shown in Table 6, the object tracking performance increases as the YOLOv3's performance also increases, due to a large decrease in the FP values as well as the IDs values, which occurs due to an improvement of the object detection performance. Regarding the frame rate results, as expected, the SORT was faster than Deep-SORT since SORT does not have to extract visual features through a CNN. The bold value highlights the best value on each column (in this case, each MOT evaluation metric).  The bold value highlights the best value on each column (in this case, each MOT evaluation metric). Table 7 presents the detection-based MOT pipeline results achieved on the ISR500 sub-dataset. Once again, the performance of the YOLOv3 is crucial for a promising object tracking performance. Overall, similar to the results attained on the ISR200 sub-dataset, the SORT method obtained the best results. However, the Deep-SORT method reached the best values on the FN, MT, and ML evaluation metrics, showing that the Deep-SORT could be most suitable for tracking larger object sequences. This happens due to an increased capability of the Deep-SORT method to re-identifying lost object sequences compared with the SORT method, which struggles to predict the position of the object when the track starts to miss. As observed in the previous evaluations, the A cost matrix, in both SORT and Deep-SORT, achieved the best MT result, meaning that the object sequence is, at least, tracked in 80% of its life span, which is very important to successfully perform mobile robot navigation tasks. The bold value highlights the best value on each column (in this case, each MOT evaluation metric).

Conclusions
In this paper, having in view navigation tasks in assistive mobile robot platforms, an evaluation study of two MOTs by detection algorithms, SORT and Deep-SORT, was presented. Moreover, eight new tracking data association metrics based on intersection over union, Euclidean distances, and bounding boxes ratio were proposed. To evaluate both tracking methods with the proposed data association metrics, the ISR Tracking dataset, which represents the object conditions from an assistive mobile robot's point of view, was also proposed. The presented pipeline consists of using the YOLOv3 network to detect and classify the objects available on RGB images, feeding the tracking algorithm. Promising results were attained by the majority of the proposed tracking data association metrics on the SORT, and also on the Deep-SORT. Overall, based on the performed experiments, the SORT method was able to achieve higher results of accuracy and precision, while the Deep-SORT method obtained the best values of FN, IDs, and MT. Moreover, the proposed A data association metric achieved the best performance on both evaluated object tracking methods. The A data association metric showed a significant improvement on the MT evaluation metric, which could be crucial to successful navigation tasks on robotic platforms. The results showed, as expected, that the object tracking overall performance has a high dependency on the object detector performance. The SORT is faster than the Deep-SORT, reaching 50 FPS on the overall pipeline (YOLOv3 + SORT). Therefore, considering navigation tasks in assistive platforms, and also considering issues associated with an object detector algorithm, the SORT method using the A data association metric obtained more robust results and, as such, can be a more suitable approach.
As future work, it is intended to integrate the presented pipeline on the RobChair [21] platform for assistive navigation tasks.

Conflicts of Interest:
The authors declare no conflict of interest.