Directional Statistics-Based Deep Metric Learning for Pedestrian Tracking and Re-Identiﬁcation

: Multiple Object Tracking (MOT) is the problem that involves following the trajectory of multiple objects in a sequence, generally a video. Pedestrians are among the most interesting subjects to track and recognize for many purposes such as surveillance, and safety. In recent years, Unmanned Aerial Vehicles (UAV’s) have been viewed as a viable option for monitoring public areas, as they provide a low-cost method of data collection while covering large and difﬁcult-to-reach areas. In this paper, we present an online pedestrian tracking and re-identiﬁcation framework based on learning a compact directional statistic distribution (von-Mises-Fisher distribution) for each person ID using a deep convolutional neural network. The distribution characteristics are trained to be invariant to clothes appearances and to transformations including rotation, translation, and background changes. Learning


Introduction
In artificial intelligence, Multiple Object Tracking (MOT) refers to the task of locating objects in a scene and maintaining their trajectories throughout the entire video. It is an essential task for a broad range of computer vision applications such as surveillance, and autonomous driving. Pedestrians are among the most interesting subjects to track in public area for many purposes such as safety and security. Therefore, lately, there has been a lot of research attention to pedestrian tracking [1,2].
In general, MOT follows the tracking-by-detection paradigm. The state of the art of this paradigm is divided into two main categories; online tracking [3][4][5][6][7], and batch tacking [8][9][10][11]. Online tracking is about associating object instances between a video past frame and current frame, whereas batch tracking uses the entire video frames to form a unique trajectory per ID.
The tracking association task requires a similarity measure between the object instances. This similarity is based on two types of features; motion properties and visual features. The visual features are extracted from the visual appearance of the object instance in the bounding box using a feature extractor. Thus, the feature extractor plays an important role in the model tracking performance. Recently, MOT proposed approaches [4,7,9] heavily relying on training a neural network base extractor. These approaches make use of advances in Deep Metric Learning (DML) [12][13][14][15] area in learning powerful visual representation. These base neural network extractors embed an input image into a sphere with a unit norm. The choice of the directional space is motivated by the assumption that the type of data is sometimes referred to as directional data, since the direction of the data provides more information than the magnitude of the data. However, the learning method for patch representational learning is largely based on contrastive supervised learning in most MOT algorithms. These approaches rely on time-consuming mini-batch formats like triplets or N-pairs, which are incompatible with most MOT learning algorithm.
Machine learning-based person re-identification applications aim to successfully retrieve pedestrians by their identity for many security and safety tasks. Most of these applications aim to learn a mapping function that embeds images into compact euclidean space (unit sphere mostly) [16,17]. These embeddings should have the characteristic in which two images of the same person map to adjacent feature points, while images of two distinct persons map to feature points that are far apart. These mapping approaches should, ideally, be resistant to real-world situational changes like position, orientation, and occlusion in the same scene. It also should not rely on one's clothes, since pedestrians wear different clothes at different lapses of time.
Most machine learning tracking and re-identification applications are heavily affected by image acquisition systems such as static cameras and the cost of collecting data. Unmanned aerial vehicles (UAV's), which enable a low-cost way of data collection while covering broad and difficult-to-reach regions, have recently been recognized as a feasible alternative for monitoring public spaces [18][19][20][21]. The advancements in UAV's have benefited MOT, particularly pedestrian tracking and re-identification, since it gives a viable solution to solve numerous challenges such as occlusion, moving cameras, and difficultto-reach locations. Compared to static cameras, UAV's are flexible enough to adapt their emplacement location and direction in the 3D space.
The re-identification has been applied in MOT for a long time. However, the contribution of this paper could be highlighted by discussing the limitations of the state-of-the-art re-identification MOT solutions. For example, although the existing methods (e.g., [22]) provide powerful re-identification MOT solutions, they are offline and cannot match the online processing in the drone application. In contrast, the proposed method is an online solution that is well-suited for drones.
Motivated by these facts, we adapted in this paper the approach proposed by [15] for image classification and retrieval to the pedestrian tracking and re-identification from aerial devices problem. This approach is a simple and efficient method for learning a von-Mises-Fisher (vMF) distribution for each ID in the directional space. For spherical data, this distribution can be considered a gaussian distribution. Learning a vMF for each ID helps simultaneously in measuring the similarity between object instances and identifying the person ID.
The following are the primary contributions of this paper: • Introducing directional statistical distribution to pedestrian tracking and re-identification. • A new object similarity measure based on pedestrian re-identification. • An end-to-end multiple object tracking and re-identification of pedestrian from aerial devices.
The remainder of this paper is laid out as follows: We start by giving a background and an overview of related previous work in Section 2. In the latter a basic understanding of directional statistics is provided alongside a review of learning algorithm of vMF proposed by [15]. Our proposed online pedestrian tracking and re-identification is presented in Section 3. In Section 4, we present the dataset case study. Penultimately, we present our experimental results. Finally, we summarize our conclusions and outline potential future work.

Pedestrian Tracking
In machine learning, pedestrian tracking refers to the task of detecting and tracking humans from video clips. It plays an important role in many safety systems such as in video surveillance and autonomous vehicle driving [1,2]. However, pedestrian tracking from a fixed camera suffers from several issues such as occlusion, access to area, and covering large areas. To overcome the aforementioned issues, unmanned aerial vehicles (UAV's) such as drones provide a cheap and effective way for collecting data. The UAV's can record the scene from different angles and follow the flow of objects. In recent years, tracking from UAV's has gained a lot of attention from researchers [18][19][20][21].

Tracking
The task of finding objects and retaining their identities over all video frames is known as Multiple Object Tracking (MOT). It is a vital problem for a broad range of computer vision applications such as surveillance, and autonomous driving. In recent years, MOT has been dominated by detection followed by the tracking paradigm, where first detections are obtained, then, instances of the same object are linked together. An object instance from frame t − 1 is at maximum associated with a single object instance from frame t. The state of the art of this paradigm can be divided into two main sub-categories:

Online Tracking Mode
Online tracking mode [3][4][5][6][7] is based on associating the detections between the past frame and the current frame. Usually, these methods are applied in real-time tracking scenarios tasks such as surveillance. Generally, this association problem is formulated as a bipartite graph problem [23]. The disjoint sets of vertices are the objects instances in frame t − 1 and frame t. The edges between the nodes represent the cost (similarity) between their corresponding instances.

Batch Tracking Mode
Methods that are based on batch tracking mode [8][9][10][11] use the entire video frame detections to build a unique track (trajectory) per ID. This is accomplished by associating past, present, and future directions. The association is generally formulated as a graph optimization problem such as maximum flow [24], or minimum cliques [25].

Deep Metric Learning
Many machine learning tasks require learning a measure of similarity between data objects such as MOT. The goal of metric learning is to learn a mapping function that quantifies the similarity between data points. The metric learning objective is to minimize the similarity between data points from the same category and maximize the distance between data points from different categories.
In recent years, deep learning has gained enormous success in a variety of machine learning tasks such as image classification, image embedding, and multiple object tracking. Deep learning brought revolutionary advances due to its representational power in extracting high abstract non-linear features. This fact led to a new research area known as Deep Metric Learning (DML) [12][13][14][15].
MOT has benefitted from the success of DML, by training a neural network feature extractor to learn a similarity measure between object instance patches.

Pedestrian Re-Identification
Pedestrian re-identification research has a broad range of aspects spanning from feature-based [26] to metric-based [27] and from hand-crafted features to deeply learned features [28,29]. In this report, we review three of the recent and relevant sub-research areas related to pedestrian re-identification problems.
Open-world person re-identification is defined as one-to-one set matches. Given two sets of pedestrian, the first called probe and the second called gallery, every person appears in both sets, and the task is to match between them [30]. In this problem, the pedestrian set must be known. However, in some open-scenarios it is not guaranteed that the probe will always have a corresponding match in the gallery which need to be catered to.
Generalized-view re-identification approached is mainly based on discriminative learning from different views acquired by two different fixed cameras [31,32]. However, in practice this problem is costly since it requires data collected, annotated and matched from two different cameras.
Recently, pedestrian re-identification from drones gained a lot of attention and new benchmark datasets are published [20,33]. Drones provide a new tool for data acquisition, especially for video surveillance and analysis. With this new tool, problems such as pedestrian detection, tracking, and re-identification can be taken to new challenges as it helps overcome some of the static camera issues.

Directional Statistics in Machine Learning
Directional data is defined as points in a Euclidean space with norm x 2 = 1, where . 2 is the Euclidean norme 2. In other words corresponding to points on the surface of the unit sphere. Thus, statistics that deal with directional data are called directional statistics.
The topic of directional statistics has gained a lot of attention due to demands from fields such as machine learning or the availability of big data sets that necessitate adaptive statistical methodologies, as well as technical improvements. Directional statistics method has led to a tremendous success in many computer vision tasks, such as image classification and retrieval [15], pose estimation [34] and Face Verification [35]. It has also been introduced to other machine learning fields such as text mining [36].

Von Mises-Fisher Distribution
Von Mises-Fisher Distribution (vMf) is a probability distribution function for directional data. It can be seen as a Gaussian distribution since they have very similar properties. In a directional data space S P−1 , the probability distribution is density function defined as: where, µ is the mean direction of the distribution, κ ≥ 0 is a concentration parameter which can be seen as the standard deviation for Gaussian distribution, p is the space dimension, is a normalization term and I v is the modified Bessel function of the first kind with order v.
Given N samples from a vMF distribution, we can estimate its parameters as follows: In Equation (3),

Learning Von-Mises Fisher Distribution
The learning problem is defined as follows. Given C identities, the goal is to learn a vMF distribution for every ID parameterized by Given a point x in the mapping space, the normalized probability of x belonging to a chosen class c is defined as Equation (4) can be used to increase the likelihood that the sample belongs to the correct class while decreasing the likelihood that it belongs to other classes. Given a mini-batch with N samples and for a C identity, we can maximize the following objective function: where X and Y represent the data points in the mini-batch and their ID labels, Θ contains the deep model parameters, For a simplification purpose, we assumed κ to be a constant for all IDs, and by applying the negative likelihood, Equation (6) can be simplified to: Since it is difficult to simultaneously optimize the neural network parameters Θ and the vMF mean direction distributions ∪, in [15] they proposed a learning algorithm (Algorithm 1) that is based on alternative learning. In this algorithm, the mean directions are fixed, while training the neural network parameters for several iterations, then updating them using the training data set. The mean direction update is based on the estimation using all training data points. The algorithm converges when the mean directions and loss are stagnant. Given a class i with N training data points. Let x n denotes the mapping of the nth sample using the current mapping function, where n = 1. . .N. The mean direction of class i can be updated as follows: Algorithm 1 vMF learning algorithm 1.
Estimate mean directions using (8) and all the training data. 4.
Train CNN for several iterations and update Θ.
During the inference phase, we can predict the ID of a given object by measuring cosine similarity with the learned mean directions. The object will be assigned with the ID label of its nearest mean vector.

Dataset Description
Over the recent years, the pedestrian tracking and re-identification topic has gained a lot of research interest due to its importance and wide applicability, such as surveillance systems and traffic control. However, the acquisition system is limited by the stationary camera, which represents a main issue for the tracking and especially the pedestrian reidentification. Lately, Unmanned Aerial Vehicles (UAV's) have been viewed as a viable option for monitoring public areas, as they provide a low-cost method of data collection while covering large and difficult-to-reach areas.
The P-DESTRE [33] dataset is publicly available and fully annotated dataset for detecting, tracking, re-identificating, and searching pedestrians from aerial devices. A group of researchers from the University of Beira Interior (Portugal) and the JSS Science and Technology University gathered this data (India). They recorded packed sights on both institutions using drones called "DJI Phantom 4". These drones, as shown in Figure 1, are piloted by humans to fly and collect data from a volunteer audience walking at altitudes ranging from 5.5 to 6.7 meters. In total, 75 videos with a Frame Per Second (fps) equal to 30 are collected. In these videos there are 318, 745 annotated instances of 269 different IDs. These statistics are summarized in Table 1. The pedestrian search challenge, in which data is collected over long periods of time (e.g., days/weeks), with constant ID labels across observations, is the primary distinguishing characteristic between the P-DESTRE and comparable datasets. The re-identification techniques in this problem cannot rely on clothing appearance-based features, which is a key property that distinguishes search from the (less difficult) re-identification problem, in which the consecutive observations of each ID are assumed to have been taken in short intervals of time and clothing appearance features can be reliably used.
We believe that this dataset is an excellent case study for training and evaluating our frameworks for pedestrian tracking and re-identification from aerial devices. This dataset was used for all of the experiments and we plan to look at more aerial datasets in the future.

Multiple Object Tracking and Re-Identification Framework
Our suggested approach is based on the generic MOT task paradigm of detection followed by tracking. We utilized YOLO V4 [37] as a detector. The proposed tracking method is then applied to track and recognize pedestrians. Figure 2 illustrates the structure diagram of this framework in the inference phase. It is comprised of 4 major steps:

1.
We used YOLO V4 for real-time object detection. The YOLO V4 detector receives input videos frame by frame. As an output, the detection bounding boxes in each frame are obtained.

2.
Patches are cropped from each frame using the bounding boxes that have been detected.

3.
These patches are resized to the unified shape (H, W, C), where H denotes height, W denotes width, and C denotes channel count. A feature vector is then generated for each detected object using the trained features extractor. 4.
Using features vectors from frames t and t − 1, the object association and re-identification algorithm matches and recognizes the objects.

Figure 2.
Multiple object tracking online framework. First, patches from frame t are extracted and resized to the same shape. Second, these patches are mapped through the trained feature extractor network to obtain the visual descriptors. Third, a pairwise similarity measure is conducted between the object descriptors of frame t and others of frame t − 1. Finally, an association algorithm is performed to match object instances based on their similarity.
Two neural network architectures, YOLO V4 and features extractor, must be trained for this framework. Using P-DESTRE dataset, we trained these two architectures independently, using the 10-fold learning/validation/test splits provided in the dataset web page. The data is randomly divided into 60% for learning, 20% for validation, and 20% for testing.

•
YOLO V4: For each fold, we fine-tuned YOLO V4 pre-trained architecture to detect only pedestrian using the train set. The detector performance is monitored using the error on the validation set. • Feature extractor: the feature extractor architecture is composed of a base model and a header. We used Wide ReseNet-50 (WRN) [38] as a base model, and header composed of two Fully connected Pooling Layers (FPL) of sizes [4096, 128], respectively. From each fold a train, validation, and test patches dataset is created from train, validation, and test sets correspondingly. The model is trained using the vMF learning algorithm (Algorithm 1). Figure 3 illustrates the training mechanism of the features extractor.
During training, the detection and vMF are trained separately using the detection ground truth which are the bounding boxes. During inference, the detections are used to get the bounding boxes and then the earned vMF mean direction is used to track and re-identify the person.

Object Association and Re-Identification
Learning a vMF distribution for each ID class has the ultimate objective of predicting the ID of an object instance during inference. These learned distributions, on the other hand, can be used to compare the similarity of any two object instances. As a result, learned distributions can be used to predict ID and assess similarity at the same time. We integrate the feature extractor into the online multiple object tracking and re-identification system once it has been trained in an offline way using the suggested framework. Current detections at frame t are cropped and scaled to fit the feature extractor's input form in this framework. Then, these patches are mapped to the embedding space. Since the system is online, it only retains the previous frame objects and their IDs at t − 1. The pairwise similarity measure between the current object instances at frame t and the objects instances at frame t − 1 is the next stage in the system. The object detection association is the final phase.

Similarity Measure
Since the detection patches are mapped to a euclidean space with unit norm x 2 , the similarity can be measured using cosine similarity or Euclidean distance. We opted to utilize one of them since the two metrics are inversely connected and provide the exact same results.

Object Detection Association
The final step in the online tracking and re-identification system is the object association and re-identification of the IDs. For object association, the current object detections in frame t needed to be matched with the others in frame t − 1. However, a pair of matched objects needs to belong to the same identity. In other words, we cannot match two object instances, while they are assigned to two different identities. Therefore, a consistency between the data association and the identity prediction needs to be established.
Prior to the start of the data association step, data preparation has to be completed. Let D t denotes the set of object detection in frame t, where d i t denote the ith detected object. Let µ c denotes the learned mean direction for ID c. The data association can be done in two different ways. First, we only rely on ID prediction by assigning the class, of which the mean direction is the nearest to the object representation: where cos is the cosine similarity function. This technique is simple and straightforward, where no data association algorithm is needed. It can be seen as a recognition problem. Moreover, it is pertinent to mention that all subjects are in-distribution that is, they are present in the training dataset. A second way to correlate targets from the previous frame D t−1 with detections produced by the current frame D t , we utilize the product result of appearance and position criteria to indicate the amount of similarity of the target-detection pair of ith target and jth detection. This product is defined as follows: Max p − Min p (10) where p i t − p j t−1 2 is the Euclidean distance between the two object centroides in the pixel(image) space. Max d and Min d are the maximum and the minimum distances, respectively, between the two sets D t and D t−1 . Each of these criteria is normalized to [0. . .1] range, so that we have a normalized score. This similarity is motivated by the fact that, the similarity between two object instances is not enough, it has to be reinforced by a position confidence score. We suppose that two detections belonging to the same ID have a similarity measure that is higher than a predetermined threshold T. The user defines this threshold, which is a hyper-parameter. We assign IDs to the unmatched subset of objects in frame t, which is denoted as D u t , using Equation (9).

Experiments
In this section, we report the performance results of the proposed framework. In order to provide a comparable results, we follow the same experimental protocol provided by the P-DESTRE web page http://p-destre.di.ubi.pt/experiments.html (accessed on 20 August 2022). The experiments were divided into three categories: (1) pedestrian; (2) detection; tracking (3) re-identification. The results of the state-of-the-art are directly reported from the original paper.

Pedestrian Detection
For the YOLO V4 detector, we set the Non-Max-Suppression (NMS) hyper-parameter intersection over union between the predictions to 0.5 and the threshold score for prefiltering to 0.05. The maximum number of positive detections (predicted pedestrian per image) is set at 100. During the inference phase, the score threshold is set to 0.3. To train the detector, 300 epochs with a learning rate of 0.0001 were sufficient. In order to align with the reported results from the original paper, we used the same development kit http://host.robots.ox.ac.uk/pascal/VOC/voc2012/#devkit (accessed on 20 August 2022) to evaluate YOLO V4 detection performance in P-DESTRE dataset. Following the 10-fold cross validation scheme provided http://p-destre.di.ubi.pt/pedestrian_detection_splits.zip (accessed on 20 August 2022), in which each fold the data was randomly divided into 60% for learning, 20% for validation, and 20% for testing.
The results of all detection methods are summarized in Table 2. The approaches were assessed using the Average Precision (AP) measure with an Intersection over Union (IoU) of 0.5 (AP@IoU = 0.5). As can be observed, the YOLO V4 outperformed the other detection methods considerably which motivated to use it in overall pipeline. It's also built for online real-time detection, making it simple to combine with other online tracking and re-identification systems as a detector. Overall, we noticed that the YOLO V4 struggled in crowded situations with occlusions and only small portion of the area was viewable. The other approaches in the P-DESTRE paper were also challenged by this problem. Table 2. The Average Precision (AP) results obtained by 4 detection methods in the P-DESTRE dataset [33]. RetinaNet, R-FCN, SSD taken from [33].

Pedestrian Tracking
We trained the feature extractor, using the 10-fold cross validation scheme of the P-DESTRE set. In this scheme, each split randomly divided into 60% for learning, 20% for validation and 20% for test, i.e., 45 videos were used for learning, 15 for validation and 15 videos for test purposes. Only the subset of identities that are available in the training videos are used for learning the vMF. From each frame, the bounding boxes are cropped and scaled to patches of dimension (48, 64, 3). The input patch is normalized to the [0, 1] range by dividing by 255 for the prepossessing. We set the concentration parameter κ to 15 for the learning algorithm hyper-parameter. This number produced the best outcomes experimentally. The feature extractor was trained using 50 epochs with a batch size of 64. 128 is the embedding space dimension. The similarity threshold is set to 0.2.
As detailed in [42], we assessed the performance of our proposed tracking system using three metrics: Multiple Object Tracking Accuracy (MOTA), Multiple Object Tracking Precision (MOTP), and F1 score. Table 3 summarizes the performance results. As shown below, our proposed tracking method slightly outperformed the other tracking methods in terms of MOTA, but provides comparable results in terms of MOTP to TracktorCV. In terms of F − 1 score, our method is significantly better.
From a qualitative standpoint, our approaches struggle with severe occlusions. Normally, it is difficult to reestablish the person ID in online tracking after occlusion happens, but our method is able to restore the person ID in several cases due to re-identification based on vMF mean direction (9). Table 3. Comparison between 3 tracking algorithms from the state-of-the-art and our tracking algorithm in the P-DESTRE dataset [33]. TracktorCv, V-IOU, IOU taken from [33]. We designed an experiment to display the tracking results in inference mode on a chosen video as a proof-of-concept for our tracking plus detection approach. We picked one of the 5-folds at random, we trained the algorithm, and then applied it to a particular video from the test set in inference mode. The results of the detection and tracking at three distinct timestamps are shown in Figure 4. The tracking algorithm was successful in keeping track of the majority of the pedestrians in the scenario. Our investigation revealed that the majority of switched identities occur as a result of miss-detection at particular frames, which results in two cases: (1) assign new ID to the pedestrian when it is detected again; and (2) two pedestrians switch their identities.

Long-Term Pedestrian Re-Identification
Long-term pedestrian re-identification is one of the main reasons for the development of tracking and re-identification based on vMF distribution. In a long-term real world scenario pedestrian clothing appears to differ between lapses of time spanning over days and weeks. Therefore, a reliable identifier must depend on persistent features such as face or body rather than clothing.
In this experiment, the data was randomly divided into 5-folds. The ratios of these folds are 50% learning, 10% gallery, and 40% query. The detailed information about this split is provided in http://p-destre.di.ubi.pt/pedestrian_search_splits.zip (accessed on 20 August 2022).
Our approach to re-identify pedestrians can be assessed in two ways. First, we utilize Equation (9) to get the nearest mean direction of the IDs. Second, top-N recall performance is used. We report using the same metrics as the state-of-the-art results in order to be consistent. We also present the results of the re-identification using the initial measure we presented in (9). The results are summarized in Table 4. Our method significantly outperformed other methods using different metrics. This demonstrates that a feature extractor based on vMF is able to learn reliable features that help in recognizing the person, rather than the person's clothing appearances. Table 4. Comparison between the re-identification performance attained by the state-of-the-art methods and ours based on vMF on the P-DESTRE dataset [33]. ArcFace + COSAM taken from [33].

Analysis
As proof of concept that the proposed framework can learn a vMF distribution for each pedestrian identity, we designed an experiment to visually represent the learnt embeddings. The experiment findings are illustrated in Figure 5. We randomly selected 7 pedestrian identities, each identify is associated with a color in the figure. We applied vMF learning algorithm on their patch dataset. For the purpose of visualization, we set the embedding dimensions to 3. We also set the concentration κ = 15. The results demonstrate that the algorithm was successful in learning vMF distributions for each identity, where patches belonging to the same identity are grouped together into a single cluster and set apart from embeddings belonging to other identities. The experiments also showed that high concentration does actually help in moving data points with identical identities in the direction of their mean, but it makes the model learn vMF distribution that are close to each other. This means that with high concentration values the learning is more dominated by the intra-variations, variations between different appearance and looking of the same pedestrians in different scenes, rather than with the inter-variations, variations between different pedestrian identities. Figure 5. The learned embedding of 7 randomly selected pedestrian identities, where each color is associated with one single identity. For the purpose of visualization we set the embedding dimensions to 3. We also set the concentration κ = 15.

Conclusions
An online (real-time) pedestrian tracking and re-identification approach based on images acquired from aerial devices (drones) is proposed and evaluated. The foundation of the proposed framework is a neural network encoder that was trained to learn the vMF distribution for each ID in directional space. The learned distribution aided in tracking and recognizing pedestrians at the same time. The P-DESTRE dataset case study was utilized to assess our proposed system's tracking and re-identification performance using standard metrics. We demonstrated that we can outperform other approaches in re-identifying pedestrians with more efficient feature retention. We plan to expand this approach and apply it to additional types of datasets in the future. Another aspect that could be looked in to is online learning. Our proposed approach has been tested for other applications and can be adopted for different object tracking such as vehicles, military equipment, and parcels.
It is worth noting that these approaches are computationally intensive for training and comparison purposes. However, with computational advances deploying these models for tracking and re-identification is becoming feasible.