A Hierarchical Association Framework for Multi-Object Tracking in Airborne Videos

Multi-object tracking (MOT) in airborne videos is a challenging problem due to the uncertain airborne vehicle motion, vibrations of the mounted camera, unreliable detections, size, appearance and motion of the moving objects as well as occlusions due to the interaction between the moving objects and with other static objects in the scene.To deal with these problems, this work proposes a four-stage Hierarchical Association framework for multiple object Tracking in Airborne video (HATA). The proposed framework combines data association-based tracking (DAT) methods and target tracking using a Compressive Tracking approach, to robustly track objects in complex airborne surveillance scenes. In each association stage, different sets of tracklets and detections are associated to efficiently handle local tracklet generation, local trajectory construction, global drifting tracklet correction and global fragmented tracklet linking. Experiments with challenging airborne video datasets show significant tracking improvement compared to existing state-of-art methods.


Introduction
The goal of Multi-object tracking (MOT) in airborne videos is to estimate the state of multiple objects and conserving their identities under appearance and motion variations over time [1][2][3][4].This is a challenging problem because of the uncertain motion of the airborne vehicle, the vibration of the non-stationary camera and the partial occlusions of the objects [5].Much attention has been paid in data association-based tracking (DAT) methods [6] along with the improvement of object detection methods, which provide reliable detections even in complex scenarios [7,8].To produce the final trajectories for each tracked object, most of the DAT approaches rely on the detection accuracy [9] and the used affinity model [10,11], integrating multiple visual cues (i.e.appearance and motion), to find the linking probabilities between detection responses and tracklets in the subsequent frames [7,8].
Existing object detectors can be roughly categorized into off-line and on-line methods.The off-line detectors use a pre-defined strategy to learn the patterns representing the object's appearance by using various kind of features.They are widely used in MOT because they are less sensitive to image noise [12][13][14].In the aerial surveillance domain, the several types of targets, their fine grained size and and appearance differences (due to their own movement as well as motion of the UAV) make such methods difficult to train for achieving reasonable detection performance.For such reasons, on-line detectors using motion compensation-based models [10,11,[15][16][17][18] are more popular in airborne videos analysis.The objects with different motion and appearance cues compared to the background can be automatically detected without any prior information.Moreover, the low computational complexity of such algorithms makes them suitable for embedded platforms on-board of unmanned aerial vehicles (UAV).
Generally speaking, the performance of the existing compensation-based detectors are usually a tradeoff between the detection rate and the false alarm rate.This is because an accurate estimation of the camera's motion model cannot be computed and is time consuming.Most of the compensation-based algorithms assume a simple camera model such as the affine or projective camera model [19].To reduce false detections, Yin et al. [20] adopted a detection method based on the forward-backward motion history images (MHI) to localize moving objects.The required forward motion history makes this method not suitable for real-time applications.To analyze the long-term object motion pattern, Yu et al. [7] used a Tensor Voting computational framework to detect and segment moving objects.This method might be impractical in many real-world applications because it requires the full image sequence for the global analysis step.Considering the errors which could arise from motion compensation, Kim et al. [21] proposed a spatio-temporal distributed Gaussian model, while a dual-model single Gaussian model (SGM) has been adopted by Yi et al. [22].These approaches reduce many false detections and achieve real-time performance with a low computation complexity, but they produce miss detections and still show unsatisfactory performance in complex scenes.In [19], the authors combined the spatio-temporal properties of moving objects and the SGM background model to reduce miss detections and false detections.
Occlusions are the main problem of both off-line and on-line detectors [8,[23][24][25].To overcome such problems, some recently proposed tracking algorithms recover the trajectories of all targets via a two stages association framework [14,24,26].In a first stage, a set of reliable short tracklets are locally generated by linking the detections to tracklets.In a second stage, to build longer tracklet and deal with frequent occlusions, a global optimal solution is obtained by solving a maximum a posteriori problem (MAP) problem using various optimization algorithms.This two-stage DAT approach can be applied for time critical applications since they sequentially build trajectories based on a frame-by-frame association.However, DAT can not be directly adopted in airborne videos as both the local and global association stages require efficient object detection with accurate object's location and size [24,25].
In this paper, we take into account most of the limitations of the previous methods and propose an efficient Hierarchical Association framework for multiple object Tracking in Airborne platforms (HATA).We adopt the SGM [22] as on-line object detector, and motivated by the works of Bae et al. [14] and Ju et al. [27] we formulate the MOT problem as a Hierarchical DAT based on tracklet confidence.The proposed hierarchical association framework consider a four stage approach for data association: a local tracklet generation stage, followed by a local trajectory construction step, than a global drifting tracklet correction stage, and finally a global fragmented tracklet linking step.To this end, the tracklets and the detections are divided into several groups depending on the tracklet confidence and association results.Furthermore, for each tracklet we maintain a Kalman Filter tracker and an appearance-based tracker, build upon Compressive Tracking [28,29] to deal with (i) target's changes in appearance, (ii) occlusions, as well as (iii) motion-less tracklets.Moreover, the appearance-based tracker is used to update the tracklets state for managing unreliable associations.
In our algorithm, we define two types of occlusions: Occlusion-I when two tracked objects overlaps, and Occlusion-II when the object is occluded by a static obstacle within the environment (e.g.trees).The Occlusion-I is handled via the detection-tracklet association.The Occlusion-II cases are more challenging because of the lack of hard temporal (frame-to-frame) constraints.For such cases, we apply an object re-identification approach, based on tracklet-to-tracklet matching using a set of appearance and motion features extracted around the target objects [30].The proposed MOT framework robustly tracks multiple objects in complex scenes and can be fully implemented for real-time applications.

Related Works
In this section, we give an overview of state-of-art methods for MOT in airborne surveillance, the main data association-based tracking (DAT) approaches on which we based our work, and finally basic object re-identification methods.
MOT in airborne videos: A number of methods for detecting and tracking objects from airborne platforms have been developed in the last decades [2][3][4]8,31,32].Early approaches adopt optical flow [33] or feature points [5,7] to detect and estimate the trajectories of the moving objects.Yu and Medioni, in [7], estimated the motion flow in each frame based on a cross-correlation methods, and then a Tensor Voting approach is used to analyze the optical flow to segment moving objects.The motion history image (MHI) method [20] is used to generate the initial segmentations, and the tracklets are generated by using the appearance similarity and flow dynamics between the segmented regions.The mean-shift algorithm is applied to predict the location in the motion field.The end (entry and exit) information of a flow is imposed as environmental constraints when associating tracklets.However, in their tracking framework, a relatively long sequence is needed to detect motion patterns, which causes tracking delays and hence not practical for real-time tracking.In [34], the Kanade Lucas Tomasi (KLT) features and a temporal differencing method have been used to separate moving vehicles from the background.Local features are clustered to establish different motion layers for vehicle tracking.This method is robust to partial occlusion.However, it fails locating vehicles when the background is highly cluttered.In order to solve this problem, they proposed in [35] a novel tracking framework based on particle filter method.An estimate of the vehicle's ego motion is incorporated into the particle filter framework to guide particles moving towards the target position.
Prokaj et al. [10] presented a method for vehicle tracking in an aerial surveillance context.First, the moving object detection was done using background subtraction.The background is modeled as the mode of a (stabilized) sliding window of frames [10].Then, they formulated the data association problem as an inference in a set of Bayesian networks using motion and appearance consistency.Such an approach avoids the exhaustive evaluation of data association hypotheses and provides a confidence estimate of the solution.Moreover, it handles split-merge observations.In [36], a collaborative framework consisting of a two-level tracking process has been adopted to track objects as groups.The higher-level process builds a relevance network and divides objects into different groups, where the relevance is calculated based on the information obtained from the lower level processes.In [16], Prokaj et al. handled the missed detections by generating virtual detections.Any time a detection in frame t does not have an object to link to in frame t + 1, a virtual detection is generated by predicting the location and appearance of the target in the next frame.This procedure is also recursive, so that when a newly added virtual detection does not have nearby detections in the next frame, the process is repeated.In [18], Prokaj et al. presented a multiple target tracking approach that does not exclusively rely on background subtraction and is better to track targets through stops.It accomplishes this by effectively running two trackers in parallel: one based on detections from background subtraction providing target initialization and reacquisition, and one based on a target state regressor providing frame to frame tracking.The detection based tracker provides accurate initialization by inferring tracklets over a short time period (5 frames).The initialization period is then used to learn a non-parametric regressor based on target appearance templates, which can directly infer the true target state from a given target state sample in every frame.When the regressor based tracker fails (loses a target), it falls back to the detection based tracker for re-initialization.However, the regressor's output would be meaningless when the target is not visible without information.
Two-stage DAT: Xing et al. [24] proposed to combine local linking and global association as a two-stage data association-based tracking (DAT) framework.They produce locally optimized tracklets by associating observations with tracklets, and global tracklets by associating fragmented tracklets.They used a greedy method for local association, and a predefined appearance model.Similarly, Bae et al. [26] proposed a Bayesian data association approach in which a tracklet existence probability is used during the local stage to assign the detections to tracks.Such an approach handles partial occlusions.The tracklet-to-tracklet global association stage is made by using an adjusted tracklet management system to link fragmented tracklets under long-term occlusions.In a more recent work, Bae et al. [14] formulated the multi-object tracking problem as a two-stage DAT based a tracklet confidence.The tracklets with a high confidence are sequentially grown with the provided detections.The fragmented tracklets, with low confidence, are linked to the other tracklets and detections without any iterative and expensive association.However, the long-term occlusions have not been considered by the authors.To improve upon the approach of [14], recently, Ju et al. [27] proposed a four-stage hierarchical association framework based on an on-line matching strategy and tracklet confidence.The tracklets and detections are divided into several groups depending on several cues obtained from the matching results and a proposed tracklet confidence.In each matching stage, different sets of tracklets and detections are associated to handle frequent and prolonged occlusions, abrupt motion change of objects, and unreliable detections.In our framework, we follow the four stages ideas of [27], however using an on-line detection approach and the involvement of multiple appearance-based trackers.
Re-identification: Object re-identification (Re-ID) has been an active research topic in the past few years.It has been intensively studied for stationary inter-camera target associations [37] for long-term object tracking.A typical Re-ID algorithm is based on appearance modeling and matching [38,39].The appearance modeling often uses low-level features such as color, texture, gradient, or a combination of them, to build more discriminative appearance descriptors [37,38].Many successful Re-ID algorithms have been proposed for special target Re-ID systems [37][38][39][40], such as pedestrians and vehicles.Liu et al. [37] exploit a spatio-temporal body-action model by using fisher vector learning to solve the large appearance variation of a pedestrian.Zapletal et al. [38] proposed an approach based on a linear regression model using color histograms and histograms of oriented gradients for vehicle re-identification in a multiple cameras scenario.Liu et al. [39] proposed a fusion model of low-level features and high-level semantic attributes for vehicle Re-ID.In our framework we follow the ideas of object matching using appearance and motion cures for object re-identification after long term occlusion.

Framework Overview
We follow the notations defined in [14].An object i appearing in a frame t is denoted as present using a binary function φ i t = 1; otherwise φ i t = 0.When φ i t = 1 the state of the object i is represented as x i t = p i t , w i t , h i t , v i t , where p i t = (p i t (x), p i t (y)), w i t , h i t and v i t = (v i t (x), v i t (y)) are, the object's center location, width and height of its bounding box, and its velocity, respectively.We then define the tracklet T i t of the object i as a set of states up to frame t, and denote it as 1:t } the set of all observation up to frame t.Following the approach of [14], the objective of MOT is to find the optimal T 1:t by maximizing the posterior probability for a given D 1:t as T * 1:t = arg max Using a tracklet confidence, Ω(T i t ) ∈ [0, 1], estimated as the affinity between a tracklet and and its associated detections, Bae and Yoon [14] formulated the above problem as The framework of the proposed algorithm.The symbols in the gray bounding box are the input to the processing stage and the symbols in the white bounding box are the output.
where T (h) 1:t and T (l) 1:t represent a set of tracklets with high confidence (i.e.Ω(T i ) > th Ω with th Ω = 0.5), and a set of tracklets with low confidence.In the above equation, the tracking problem is solved in two phases.In a first phase, tracklets with high confidence are locally associated with provided detections (part denoted by RA), while tracklets with low confidence, which are more likely to be fragmented, are globally associated with other tracklets and detections in a second global phase (U A).
In our framework, we follow the same ideas, though, we use the four-stage hierarchical association concept proposed in [27] to find the optimal assignments for the local tracklet-to-detection or global tracklet-to-tracklet.However, we extend the approach of [27] by considering an appearance-based tracker associated to each tracked object, to better characterize motion-less or occluded objects, along with a detection refinement process to manage inaccurate detections.The flowchart of the proposed method is shown in Figure 1.
At each stage, the tracklet-to-detection or tracklet-to-tracklet assignment is solved by using the Hungarian algorithm approach [41].For each frame, we first apply a motion compensation-based object detector to detect objects of interest (see Section 3.3).After the local tracklet-to-detection association of Stage-1, a tracklet state analysis, involving an appearance-based tracker (Section 3.5) and a Kalman Filter tracker (Section 3.6), is used to characterize motion-less or occluded objects (Section 4.1.2),and a detection refinement process is used to manage inaccurate detections which have not been associated to tracklets (Section 4.1.3).After a first global tracklet-to-detection association of Stage-2, the un-matched detections are used to generate new tracklets in Stage-3.Some of these new tracklets are used to re-link the lost tracklets during the global tracklet-to-tracklet association of Stage-4.Stage-4 also handles tracklets termination.All the symbols used in Figure 1 are introduced in the following.

Hierarchical groups of detections and tracklets
We follow the ideas of Ju et al. [27] and define hierarchical groups of tracklets and detections as follows.In each frame t an object detector (see Section 3.3) detects objects of interest and produces the set D t of detections, the elements of which are associated to tracklets during the first two association stages.During the association process the set D t is decomposed in 4 sets: D M 1 t and D U 1 t being the matched and un-matched detections during Stage-1, respectively, and t the matched and un-matched detections during Stage-2, respectively.
During the hierarchical association process, the set of tracklets in the t-th frame, T t , will be decomposed into three disjoint subsets: with T A t the active tracklet set, T C t the candidate tracklet set, and T I t the inactive tracklet set.
• The active tracklets set, T A t , includes the tracklets corresponding to the current existing objects, composed of three disjoint subsets: with T A n (h) t the new active tracklet (recently generated tracklet) set with high confidence, T A(h) t the reliable active tracklet set with a high confidence, and T A(l) t the un-reliable active tracklet set with low confidence.They are formally defined as follows. T T where th L is a threshold on the tracklet length L(•) for distinguishing new tracklets from old ones, th Ω is a threshold on the tracklet confidence Ω(•) for characterizing whether the tracklet is reliable or un-reliable (e.g., likely to drift or lost).• The candidate tracklet set T C t includes the tracklets waiting for enough matched detections in the third stage before being added as new active tracklets.
• The inactive tracklet set T I t includes two disjoint subsets: where T I o t and T I e t represent the lost tracklet set and the terminated tracklet set, respectively.T I o t includes tracklets corresponding to the temporary lost objects, due to long-term occlusions, while the terminated tracklet set T I e t includes the disappeared objects.Each subset is defined as: T where th I is a threshold for distinguishing active and non-active tracklets, t i e is the last frame of the active tracklet, and th e is a threshold to terminate the tracklet.
Figure 2 illustrates tracklet's status changes in time according to the tracklet confidence.The overall process is summarized here after.In Stage-1, we determine the best associations between the previous set of active tracklets T A t−1 and the detection set D t at frame t.Then, the states of the matched tracklets are updated based on the associated detections and the appearance-based predictions.For the un-matched tracklets, a tracklet analysis (Section 4.1.2),using the appearance-based predictions, is performed to update their states.According to the tracklet analysis, some tracklets are updated using the appearance-based prediction and others are updated using a motion-based prediction.Then, the tracklet confidence values are estimated using the associated detections.Based on the confidence value, a tracklet is assigned to the sub-sets T  In Stage-2, the association between the unreliable tracklets, T A(l) t , and the un-matched detections t , is performed to handle drifting targets caused by frequent occlusions.The states of the tracklets which have been matched with detections are updated using the associated detections and assigned to . The un-matched (to detections) tracklets are moved to the inactive tracklets set, T , when their confidence Ω(T i t ) is lower than a given threshold th I (i.e Ω(T i t ) < th I ).Then, in Stage-3, the association between candidate tracklets, T C t−1 , and the remaining un-matched detections, D U 2 t , is performed to update the set of candidate tracklets, T C t , or generate new active tracklets in T A n (h) t .
Finally, in Stage-4, the association between the lost tracklets, T , in the inactive tracklets set, and new tracklets is performed to merge fragmented tracklets of the same object after long-term occlusions.The inactive tracklets which are not associated to new tracklets within t − t i e ≥ th e are terminated and included to the set T I e (l) t after the fourth stage.The 4 stages are detailed in Section 4.

On-line detection
In our framework, we use the method described in [19,22] as on-line detector.The detector models the background through a dual-mode SGM and compensates the motion of the camera by mixing neighbor models.Modeling through a dual-mode SGM prevents the background model from being contaminated by the foreground pixels, while still allows the model to be adaptive to the changes of the background.After the detection step, a post processing step, consisting of dilation and erosion, is performed to merge scattered detections.Finally, a bounding box is estimated around every detected blob.The detector achieves real-time performances with low computation complexity, while it produces miss detections and false detections.
The detection results are illustrated in Figure 3.Most of the missed detections and false detections are caused by occlusions or motion-less objects.Figure 3(a) shows a reliable detection bounding box, which perfectly encloses the object.However, in cases of slowly moving objects, the bounding box can cover part of the object (Figure 3(b)).The detector can also provide 2 (or more than 2) bounding boxes for a single object (Figure 3(c)).In the following, the above cases are denoted as Motion-I type detection.Furthermore, motion-less objects cannot be detected with the used algorithm, and we denote such cases as Motion-II type detection, as shown in Figure 3(d).
In our algorithm, we define two occlusion cases: Occlusion-I and Occlusion-II.Occlusion-I include all occlusions caused by other tracked objects.We define the more front object as occluder and the occluded object as occluded.In general, a good detection bounding box can be obtained for the occluder object.However, when two objects (or more than 2) are very close, only one detection is obtained (Figure 3(e)) and the size of the bounding box matches one of the two objects (Figure 3(f).The Occlusion-II case includes the occlusions which are caused by static objects (obstacles) within the environment (e.g., trees and buildings).This case is more challenging because of the lack of hard temporal (frame-to-frame) constraints and not reliable object representation form the detected bounding boxes, and so the obtained bounding boxes do not match the object size, as shown in Figures 3(g).Also, Occlusion-II case includes objects that are fully occluded by the environment (Figure 3(h)).
To deal with the above described unreliable detections, we implemented a detection refinement process (see Section 4.1.3)in which the states of the current tracklets are used to analyze and refine unreliable detections for further tracklets-to-detections associations.

Tracklet confidence
The tracklet confidence, Ω(T i t ), expresses how well the constructed tracklet matches the real trajectory of the target.In our framework it is defined as: where Ω Λ (T i t ) and Ω o (T i t ) are the affinity and observation confidence terms, respectively.Depending on the association stage, J ∈ [1,4], the affinity confidence term Ω Λ (T i t ) is calculated using an affinity model Λ J T i t , d i k involving the appearance, shape and motion of the objects.The used affinity models are defined in Section 4. The observation confidence term Ω o (T i t ) is computed using the tracklet length L(T i t ) and L M = (t i e − t i s + 1 − L T ), while, w d is a control parameter relying on the performance of the detection and it will be discussed in the experimental Section 5.2.1.w i p is a control parameter relying on the performance of the i-th tracklet prediction and defined in Eq. ( 24) Section 4.1.2.The observation confidence Ω o (T i t ) decreases rapidly if the detection responses of the tracklet T i t are missing over L M frames (heavily occluded tracklet).A tracklet is considered as a reliable tracklet, , if it has a high confidence, i.e.Ω(T i ) > th Ω (th Ω is set to 0.5 in our experiment); otherwise it is considered as a fragmented tracklet with low confidence, T i(l) t ∈ T A(l) t .

Appearance based prediction
Object appearance modeling is important in our framework for both tracklet state analysis and detection refinement processes.To maintain reliable appearance model of the tracklets we make use of the discriminative appearance model of the compressive tracking (CT) algorithm of [28,29].For each object i we associate a Fast-CT tracker (FCT) as proposed in [29].
We summarize here after the main components of the CT algorithm being (1) Naïve Bayes classifier update, and (2) target detection, the reader is refereed to [28,29] for the algorithmic details.
1. Naïve Bayes classifier update: CT samples some positive samples near the current target location, and negative samples far away from the object center.To represent the sample z ∈ R w×h , CT uses a set of rectangle features and extracts the features with low dimensionality using a very sparse ) the high-dimensional image feature, formed by concatenating the convolved target images (represented as column vectors) with rectangle filters; a ∈ R n the lower-dimensional compressive features with n m.Each element a i in the low-dimensional feature a is a linear combination of spatially distributed rectangle features at different scales.A simple Bayesian model is used to construct a classifier based on the positive (y = 1) and negative (y = 0) sample features.The compressive sensing algorithm assumes that all the lower-dimensional samples of the target are independent of each other, ) .The parameters of the Naïve Bayes classifier are incrementally updated according to the four parameters of the classifier's Gaussian conditional distribution (µ 1 , σ 1 , µ 0 , σ 0 ) and a update rate λ > 0. 2. Target detection: The candidate region corresponding to the maximum H(a) is regarded as the tracking target location: See [28] for the detailed implementation.The overall performance, in terms of speed and tracking accuracy, of the CT algorithm has been significantly improved by the fast compressive tracking (FCT) presented in [29].While the CT samples in a fixed rectangular region in single pixel steps, the FCT improves upon this by introducing a coarse-to-fine search strategy to reduce the computational complexity in the detection procedure.
In our implementation, for each new active tracklet T i(h) t

∈ T
A n (h) t , the latest object state x i t = p i t , w i t , h i t , v i t is used to initialize an FCT-based tracker and keep the four parameters of its appearance model (µ At each new frame t, the coarse-to-fine sampling strategy [29] is used to crop a set of candidate samples around the previous location of the target.The sample which obtains the maximal classifier response (Eq ( 14)) is selected as the current appearance-based prediction of the target's location,lc i t .The FCT-tracker outputs a target-state denoted as c i t = (lc i t , wc i t , hc i t ), with wc i t and hc i t the width and hight of the corresponding bounding box, respectively.In our implementation of the FCT algorithm, we use a dynamic learning rate defined as λ = Ω(T , we set λ = 0 to stop the update.For the tracklet T i(l) t ∈ T I e (l) t , we delete the appearance model.

Motion based prediction
The motion model describes the dynamic movement of tracked objects, which can be used to predict the potential position of the objects in the future frames, especially under occlusion.In most cases, it is assumed that a given object moves smoothly in the world and hence, their image apparent motion is also smooth [9].A linear motion model based on Kalman Filter (KF) is the most used model in MOT [24,42,43].Given the motion model of a moving object, KF provides an optimal estimate of its position at each time step.
In our framework, we use KF to predict the position and velocity of a target object.For each tracked object we maintain a Kalman Filter state xk i t = pk i t , vk i t .We use the propagation equation of the KF to predict the object's state when it is not associated to any detection, and use the update equation of the KF to update the state of the object when it is associated to a detection.In such case, the observation vector is the center location of the associated detected blob given by its coordinates p d = (p(x), p(y)).The state transition matrix is defined as , and the observation matrix defined as H = 1 0 0 0 0 1 0 0 .

Four stages hierarchical association framework
In this section we describe the different stages of the proposed framework for sequentially and robustly tracking multiple objects.

Stage-1: Local progressive trajectory construction
The first association stage solves the assignment problem between the active tracklets, T A t−1 , and the current detections, D t , to progressively build object trajectories.The input pairs for this stage are {( , and the association is evaluated using the following affinity model: where are the appearance affinity, the shape affinity and the motion affinity, respectively.They are defined in the following section.

First association via the affinity score
To rapidly evaluate the affinity appearance for real-time applications, a template matching-based approach is used.Each active tracklet maintains the latest template and the historical template set consisting of N a H templates (N a H =10 in our experiments).The templates of the detections and tracklets are obtained using a 24-bin Red-Green-Intensity histogram extracted from the image patches within the bounding box.All patches are resized to 64 × 64 pixels to be invariant to object scaling.Let χ d j be the template of a detection d j t , χ L T i be the latest template of the tracklet T i t−1 , and similarity between two templates, and we define the appearance affinity, Λ 1 a in Eq.( 15), of a tracklet T i t−1 and a detection d j t as: where ρ(•, •) is the Bhattacharyya distance, and ω a = Ω(T i t−1 ).The shape affinity, Λ 1 s in Eq. (15), between the tracklet and the detection is defined as: where (w i , h i ) and (w ) are the widths and the heights of the bounding boxes of the tail of tracklet T i t−1 and the detection d j t , respectively.The motion affinity, Λ 1 m in Eq. (15), is evaluated between the tail of the history of the tracklet and the detection d j t based on a linear motion assumption [14]: where pi = p i tail + v i F Θ t , p i tail and p j d represent the position of the target T i t−1 and detection d j t , respectively, v i F is the forward velocity of T i t−1 , estimated via the associated Kalman filter (KF) using the latest N F v (N F v =4 in our experiments) states of tracklet T i t−1 , and N (•) is a Gaussian distribution function.
Then, an association score matrix S 1 is used to express the affinity score between the detections and tracklets, The Hungarian algorithm [41] is used to determine the tracklet-detection pairs with the lowest affinity value in S 1 .A detection d j t is associated with T i t−1 when the association cost s ij is less than a pre-defined threshold θ [14].

Tracklet analysis and update based on prediction
Once a tracklet is associated with a detection, the state (position, velocity and size) of the object is updated with the associated detection.However, as the detection's bounding box does not always fully represent the object (see Figure 3(b),3(c),3(g)) the location, width and hight of the state vector x i t of the tracklet T i t is estimated using the FCT tracking results c i t and the detection d j t as follows: with is the bounding box of d j t or c i t , and ∩ and ∪ are the intersection and union operators between bounding boxes, respectively.The velocity v i t of the state vector x i t is updated using the KF output.In our framework, the detector acts as an unbiased observation model while the FCT tracker refines the results in an adaptive way.This fusion strategy can efficiently solve inaccurate detections as shown in Figures 4(a)-4(c), especially for objects of Motion-I type.
For the un-matched objects (tracklets not associated to detections), the FCT-based prediction, c i t , is used to analysis their occlusion state using the following constraint: is the appearance similarity between the FCT-tracker prediction c i t and the templates history of the object i (tracklet T i ) at time t being the latest time the object i has been updated with an associated detection.It is defined as: where ζ a (c i t , T i t ) is used to distinguish the motion-less objects from the ones occluded by obstacles, and ) is adopted to suppress objects drift when the FCT-based prediction overlaps with a matched detection (tracklet).
In our experiments, we assume that an object is motion-less of Motion-II type when ζ(c i t , T i t ) > th o (th o =0.5), otherwise, it is an occluded object (ζ(c i t , T i t ) ≤ th o ).As shown in Figure 4(d), the motion-less object obtains reliable appearance cues, while both the appearance and motion cues are un-reliable for the occluded objects in Figure 4(e)-Figure 4(h).
After the tracklet state analysis, the FCT-based prediction c i t is used to update the state of a motion-less object (Motion-II).While, the state of the occluded objects (both Occlusion-I and Occlusion-II) are updated using the KF prediction.Indeed, to reduce the drifting effect of occluded object, we assume the targets do not change their motion abruptly and use KF to predict their next position.Peer-reviewed version available at Remote Sens. 2018, 10, 1347; doi:10.3390/rs10091347 After the state update, the tracklet's confidence (Eq.( 11)) of the matched tracklets are updated using the affinity Eq.( 15), and w i p defined as: Consequently, according to the confidence level, Ω(T i t ) >= th Ω , they are added to the set T A(h) t or T A(l) t .
In estimating the confidence level, w i p = ζ a (c i t , T i t ) is used to slowly reduce the tracklet confidence of the motion-less objects according to the appearance similarity, and w i p = 0.4 is used to reduce the value of the tracklet confidence of the occluded objects to change them to unreliable tracklets, T A(l) t , for input to Stage-2 for occlusion analysis.

Detection refinement
In Figure 3, we illustrated some inaccurate detections caused by two (or more) spatially close objects, which might increase the object's identity switch and false alarms.Therefore, we propose a detection refinement process to solve these problems.For the un-matched detection d j t ∈ D U 1 t after Stage-1, we delete inaccurate detections from D U 1 t when their bounding box is overlapping with more than 2 un-matched objects updated by the FCT appearance-based prediction.Thus, the inaccurate detections in Figure 3(b),(c),(e),(f),(g) are deleted if they do not associate to any tracklets.After this step of detection refinement, all remaining un-matched detection t are used in Stage-2, along with the un-reliable tracklets in T A(l) t .

Stage-2: Handling drifting tracklets
In complex situations of airborne videos, where objects are occluded while the mounted camera changes its motion, conventional on-line tracking methods, based on a simplified motion model (e.g., the used KF-based constant velocity model), are prone to produce drifting problems [25,44].If the object continue drifting, it is difficult to re-assign it to detections or re-appearing objects (Occlusion-I and Occlusion-II).In the proposed framework, the second association stage solves the reassignment problem between un-reliable tracklets T A(l) t and un-matched detections D U 1 t not associated during the first stage.An un-reliable tracklet in T A(l) t is converted into a reliable tracklet in T A(h) t if it can be re-associated with a detection, otherwise, it maintains the same state or converted to an inactive tracklet in T I o (l) t after state update.There are two aspects which are considered in this stage: (1) if the object is occluded by an occluder, it might re-appear again around the occluder.The un-matched detection near the occluder has the high possibility to be re-associated to the re-appeared object after occlusion; (2) if the object has been occluded by environmental obstacles, it might re-appear at any position in the image.We assume that the occluded object might re-appear in a limited region around the occluder.The more time it disappears, the larger search region should be.

Second association via the affinity score
For the current frame t, the input pairs of this association stage are {(T i t , The affinity of the second association is defined as: ) is the maximum allowed distance of an acceptable detection to be associated to be associated to T i t , where Ω( ) is the tracklet confidence, and L M is the number of frames in which the i-th object is missing due to occlusion or un-reliable detection (defined in Eq. ( 11)).

Tracklet correction
The second association allows us to re-assign drifting tracklets to the detections of re-appearing objects in a limited time.An association score matrix S 2 , same as in Eq.( 19), is used to express the affinity score between the detections and the tracklets, and the Hungarian algorithm [41] is used to determine the tracklet-detection pairs with the lowest affinity value in S 2 .After association, the state and the confidence values of the associated tracklets are updated with the associated detections using Eq.( 20) and Eq.( 11), respectively.Here, To update the state of the re-appeared tracklet we only use the matched detection and set w f = 1 in Eq. (20).Finally, the trajectory within the drifting interval is corrected via a linear interpolation between the previous location of the tracklet and the updated one.

Stage-3: New active tracklet generation
The third association stage solves the assignment problem between the candidate tracklets  In challenging situations, where the objects are constantly occluded by other objects or obstacles for a long-time, tracklet fragmentation is likely to occur and the same object can be divided into two or more tracklets, as illustrated in Figure 5. Motivated by the works in object re-identification [38,39] to build long-term object trajectories based on appearance modeling and matching, the fourth association stage of the proposed framework solves the assignment problem between the lost tracklets T I o (l) t and the new tracklets T A n (h) t to link these fragmented tracklets, re-identify the lost objects and thereby building longer trajectories.Due to the fact that targets in airborne videos have very similar appearance, false tracklet linking might occur if only based on the appearance modeling.Thus, both the appearance and motion terms are considered in the fourth stage.

Fourth association via the affinity score
The input pairs of the forth association in the current frame t is the set {(T i t , }.The affinity the fourth association is defined as, where Λ 4 a T i t , T j t and Λ 4 m T i t , T j t are the appearance and motion affinity score, respectively.The appearance affinity Λ 4 a T i t , T j t is defined as: where N i H and N j H are the number of templates of the tracklet T i t and T j t , respectively.χ l T i is the l-th template of tracklet T i t , χ m T j is the m-th template of tracklet for (a, b) = (l, j) and (a, b) = (m, i).The motion affinity Λ 4 m T i t , T j t is evaluated between the tail of the history of the trackelt T i t and the head of the tracklet T j t with the time gap Θ t [14] based on a linear motion assumption: where pi = p tail i and p head j represent the position of T i t and T j t , v F i is the forward velocity of T i t and v B j is the backward velocity of T j t estimated using the KF with the latest and first N B v states of the tracklet T i t and T j t , respectively.N (•) is a Gaussian distribution function.

Object re-identification via tracklet Linking
The association score matrix ) is used to express the affinity score between tracklets in the fourth stage.The Hungarian algorithm [41] is used to determine the (i, j) pairs of tracklets with the maximum affinity in S 4 .The tracklet T j t is associated with T i t when the association cost s ij is less than a pre-defined threshold θ [14].If a lost tracklet T i t and a new tracklet T j t are associated, they are considered as the same object and merged and their trajectory are linked with an linear interpolation.We assign to the new tracklet T j t the ID of lost one T i t .Thus, the lost objects are re-identified using the above described tracklet linking process. .

Datasets
We evaluated our approach on two datasets, the VIVID dataset [45] and the SAIIP dataset.Figure 6 illsutrates few images from the used datasets.The first dataset includes 5 visible data sequences and 3 thermal IR data sequences.The second dataset includes 4 sequences which are captured using an UVA belonging to the Northwestern Polytechnical University.Table 1 lists the different sequences with their main challenging situations, including Illumination variation (IV), Scale Variation (SV), Occlusion (OCC), Background Occlusion (BOC), Motion Variation (MV), Image Blurring (IB) and Shadow Interference (SI).
In the EgTest01 sequence, the vehicles loop around a runway and then drive straight.Some vehicles have very similar appearance.In the EgTest02 sequence, two sets of three vehicles pass by each other on a runway.Changes of scaling occur because the airborne camera circles the scene.The data association for the EgTest02 sequence is more difficult than for the EgTest01 sequence due to severe occlusions.This also happens in the EgTest03 sequence, where two sets of three vehicles pass by each other on a runway.In the EgTest04 sequence, a line of vehicles travels down a red dirt road.In the EgTest05 sequences, a vehicle is moving along a dirt road in a wooded area.Occlusion and illumination variations occur when the vehicle passes in and out of tree shadows.
The sequences of PkTest01, PkTest02 and PkTest03 are thermal IR data.In the PkTest01 sequence, the vehicles are frequently occluded by the trees.In the PkTest02 sequence, the vehicles stop at an intersection then continue.The main issues are: occlusion, the shadows and camera auto-gain problems.The thermal IR contains a line of vehicles in a stop and go scenario in the PkTest03 sequence.As in the previous sequence, occlusions, shadows and camera auto-gain are prevalent in this sequence.Moreover, the vehicles are small, and the camera viewpoint is nearly nadir.
All the sequences from the SAIIP dataset (SpTest01, SpTest02, SpTest03 and SpTest04) are captured over a provincial road.There are less occlusions because the camera pointed at the road to take the videos and most of the vehicles are moving with a high speed while keeping a safety distance between each other.However, several targets have very similar appearance, and some of them are stopping at the crossroad.There are also some auto-trucks with a long body size which might be detected as two separated objects.

Parameters Setting
The proposed MOT framework has been implemented on a PC with an Intel Core 2.40 GHz CPU with 32GB RAM.In the following we describe the parameters setting of each module of the framework.

Parameters of the detector
We first compare three recently used motion compensation-based detectors, and then analysis the parameters setting of the used detector.The three compensation-based detectors include the basic compensation-based detector (BCD) [20], the MHI detector [20] and the SGM detector [22].All the source codes have been provided by the authors.For a fair comparison, the same parameter setting used by the authors in their original publication are consisted.Both BCD and MHI detectors assume a pre-defined threshold, T θ =20, to determine the detections in each image.The SGM detector relies on the parameter of the grid size T θ × T θ with T θ =10 [22] for determining the detections.
For the quantitative evaluation of the detectors performance, we use the detection ratio (DTR) As shown in Figure 7, the MHI-based apparoach can efficiently reduce FAR compared to the BCD-and SGM-based approaches.However, the required forward motion history is not suitable for practical applications.In our implementation, we select the SGM-based detector which has comparable DTR and FAR to the MHI-based approach while it performs in a real-time.
using motion-based compensation approaches, the detection performance depends on the velocity of the tracked objects and the complexity of the background.As such, a single fixed determining threshold T θ is not suitable for all the test sequences.Table 2   with the computational cost in terms of frames per second (FPS), of the SGM-based detector with different determining thresholds on the VIVID dataset and SAIIP dataset, respectivley.As it can be noticed, on the VIVID dataset, with increasing values of the determining threshold, both the DTR and FAR ratios reduce.The obtained results on the SAIIP dataset are very similar while with less computation when the determining threshold is increased.The computation cost on the SAIIP dataset is higher than on the VIVID dataset due to the larger image size.

lists the DTR ratio and FAR ratio, along
For the experiments reported in the following sections, we set w d = 0.5 in Eq. ( 11) and T θ = 10 for the 5 visible data sequences and T θ = 5 3 thermal IR data sequences of the VIVID dataset.For the and the SAIIP dataset we set T θ = 15 and w d = 0.7.Note that, w d is set to a large value when the detector provides high accuracy [14].

Parameters of hierarchical framework
All parameters of the tracking framework have been set empirically, and remained unchanged for all datasets.
• For the affinity models of Eq. ( 15) and Eq. ( 26), the parameters m F and m B are set to diag [30 2 75 2 ].
• The same threshold θ = 0.4 is used for the association score matrices S 1 , S 2 , S 3 and S 4 to determine the association results.• For the FCT trackers, in our experiments, the search radius for drawing positive samples in the on-line appearance-based classifier is set to α = 4, to generate 45 positive samples.The inner and outer radius for the negative samples are set to β = 8 and ζ = 30, respectively, to randomly select 50 negative samples.The initial learning rate λ of the classifier is set to 0.9.The size of the random matrix is set to 100.• For the Kalman Filter model, the process (Q) and measurement (R) noise covariance matrices are , and R = 0.1 0 0 0.1 , respectively.

Comparison to state-of-art frameworks
To demonstrate the tracking performance of our proposed framework, we compared it to the MOT approaches of [14] and [10] on the selected datasets.All the approaches, including ours, adopt the same detection configuration, and a window size of 5 frames is defined to remove unreliable shorter tracklet.For both [14] and [10], we used the publicly available codes provided by the authors.

Evaluation metrics
The popular evaluation metrics defined in [46], as listed in Table 3, are used for evaluating the performance.The precision of the intersection area over the union area of bounding boxes (PR), the number of the trajectories in the ground-truth (GT), the ratio of the mostly tracked trajectories (MT), the ratio of the mostly lost trajectories (ML), the ratio of partially tracked trajectories (PT), and identity switches (IDS).

Comparison of data association
The qualitative comparison between different versions of the proposed system on sequence EgTest02 is given in Table 4.The two considered versions, S 1 and S 2 are defined as follows: • S 1 : correspond to the framework without tracklets analysis and detection refinement.It corresponds to the approach of [14].It adopts the method of [14] to estimate the tracklet state.The position and the velocity of the matched tracklets are updated with the associated detection while the un-matched tracklets are updated using the KF motion-based predictions.The size of the object is updated by averaging the associated detection of the recent past frames.• S 2 : full proposed HATA framework as illustrated in Figure 1.Comparing the results of the frameworks S 1 and S 2 , one can notice the effect of the tracklet analysis and detection refinement processes in the proposed framework S 2 .Notice from Table 4, the system S 1 performs well for the MT and ML measures, while, the high false alarm rate and unreliable detections introduce high IDS measure, due to inaccurate location and size of the detections, which affects the association between tracklets and detections.As expected, the proposed framework S 2 improves the performance for most metrics, and it efficiently reduces the IDS measure compared to S 1 .Figure 8 illustrates the tracking results of S 1 and S 2 using the threshold T 3 θ on sequence EgTest02.As shown in Figure 8, the targets with ID-2 and ID-3 in frame #390 have an accurate location and size using the framework S 2 even with inaccurate detections input.This is due to the use of the FCT tracker to correct the state of the tracklet (see Eq. ( 20)).Similarly, S 2 performs well in frame #460 with the help of the tracklet analysis and detection refinement process, which efficiently avoid the false new tracklet generation (ID-11 in system S 1 ).This also happens in frame #532.

Comparisons to other MOT frameworks
A quantitative comparison between our proposed framework and state-of-the-art algorithms is given in Table 5.Both [14] and [10] achieve good results with the available detections, while with poor performance under inaccurate detection.Instead, our algorithm improves the performance for the used evaluation metrics (ML, MT and IDS).The qualitative tracking results of our approach are shown in Figure 9 and Figure 10.
Results using the VIVID dataset: Figure 9 illustrates the tracking results using the 8 sequences from the VIVID dataset.For the EgTest01 sequence, all considered approaches perform well because of the reliable detections.Our proposed framework achieves the best results when the appearance and the motion of the vehicles vary during the loop around period (frame #28, #172 and #323).In the  EgTest02 sequence, two sets of vehicles pass by each other on a runway and one set is occluded by the other set between frame #443, #482 and #670.Both [14] and [10] produce ID switches with most of the tracked targets, while HATA identifies well most of the tracklets.HATA also performs well in the EgTest03 sequence.In the EgTest04 sequence, only HATA solves the ID switches problem when the vehicle with ID-3 has been occluded by the trees in frame #721.In the EgTest05 sequence, HATA handles well the occlusion in frame #590 and #701 and the illumination changes when the targets pass in and out of the shadowed wooded area.Figure 9(f)-(g) illustrate the tracking results using the thermal IR sequences PkTest01, PkTest02 and PkTest03.In the PkTest01 sequence, only HATA identifies well the vehicle which is frequently occluded by the tress between frames #128 and #278.Our algorithm constantly keeps on tracking the vehicles which stop at the intersection in frame #561 and restart moving after frame #654 in the PkTest02 sequence.Like for visible data, HATA solves the occlusion and illumination variation problems in IR data, as shown in frames #833 and #1229.In the PkTest03 sequence, the vehicles are frequently occluded by trees after frame #298, and HATA can robustly save the correct ID for each tracked target in frame #374 and frame #386.Results using the SAIIP dataset: Figure 10 illustrates the tracking results using the SAIIP dataset.For the SpTest01 sequence, all the moving objects are well detected (Figure 10(a)).HATA efficiently tracks all the detected objects.The false alarms are removed when the bounding boxes size are smaller than a pre-defined threshold T f al = 5 × 5.This strategy is also adopted for the sequences SpTest02, SpTest03 and SpTest04.The SpTest02 sequence is more challenging than SpTest01 sequence when the vehicles slowdown their motion.HATA solves the motion-less problem, as shown in frames #564 and #709 of The proposed method was implemented using MATLAB on a PC with an Intel Core 2.40 GHz CPU with 32GB RAM without parallel and GPU processing.The average speed of the proposed method using the VIVID dataset is about 18 FPS and 15 FPS for SAIIP dataset, excluding the detection step.The results show the improved performance of the proposed method, compared to state-of-the-art methods.However, it is worth noting the limitations of our proposed algorithm.Generally, the  Peer-reviewed version available at Remote Sens. 2018, 10, 1347; doi:10.3390/rs10091347limitation comes from three aspects: (i) the unreliable object initialization caused by motion-less objects occluded objects; (ii) reduced performance of the FCT-based tracker when objects changes abruptly their appearance which causes unreliable tracklet state analysis, and hence unreliable matching; (iii) the fixed parameters of the detector, which are not suitable for other type of datasets.

Conclusions
In this paper, an on-line multi-object tracking method has been proposed for airborne videos to solve the association problem caused by unreliable object detections.To robustly track objects in complex scenarios, we proposed an efficient Hierarchical Association framework based on the tracklet confidence and an FCT-based appearance tracking for multiple object tracking in airborne videos.The proposed framework can handle well tracklet generation, progressive trajectory construction, tracklet drifting and fragmentation.Each association stage of the hierarchical framework solves different assignment problems achieving reliable performance with 16 frames per second in MATLAB.The obtained results demonstrated the effectiveness of our framework compared to state-of-the-art methods.In the future we will seek for reproaches combining the proposed motion compensation based detector with an on-line multi-object detection approach to reduce the false alarm rate of detections, as well as consider a deep learning approach for a better objects re-identification after long-term occlusion.

Figure 3 .
Figure 3. Illustration of detection results.The bounding boxes with the red color and dotted green color are the detection results and the ground truth, respectively.

Figure 4 .
Figure 4. Illustration of Stage-1 association.The bounding boxes with the red color are the detection results.The bounding boxes with the green color are appearance-based predictions as results of the FCT tracker.The un-matched objects are marked with yellow dotted circle and yellow color.(a)-(d): matched objects having a high tracklet confidence; (e)-(h): matched objects having a low tracklet confidence.

T C t− 1 Figure 5 .
Figure 5. Fragmented tracklet under long-term occlusions.(a) Two tracked objects ID-3 and ID-4; (b) the object ID-3 is partially occluded, and (c) heavily occluded by trees; (d) the lost object ID-3 is switched to ID-6 when it reappears again after the occlusion.
N T O and the false-alarm ratio (FAR) r F = (N A O − N T O )/N A O , where N D O represents the effective number of the detect object, N T O represents the number of the true objects, and N A O represents the total number of detections.A detection with bounding box B D is considered successful, if SR = Area(B D ∩B GT ) Area(B D ∪B GT ) ≥ T SR (in our experiments T SR = 0.5) for a ground truth bounding box B GT .To well analyze the influence of the threshold, T θ , of the considered motion compensation-based detectors, we define different values of T v θ = 10 • θ v , with θ v = {0.5, 0.75, 1, 1.25, 1.5}.

Figure 6 .Figure 7 .
Figure 6.Scenes from the public DARPA VIVID dataset (first two rows) and the SAIIP dataset (last row).

17 Figure 8 .
Figure 8.Detection and tracking results.First row: the detection results.Second row: the bounding box for each detection.Third row: the tracking results using the framework S 1 .Fourth row: the tracking results using the framework S 2 .
Figure 10(b).Both SpTest03 sequence and SpTest04 sequence are captured around a crossroad where the vehicles are slowing down, stopping or changing directions.In the SpTest03 sequence, as shown in Figure 10(c), HATA identifies well the object with ID-4 when it changes direction in frame #122.Moreover, HATA achieves long-term trajectory for the object with ID-1 in frame #245.In the SpTest04 sequence, a lot of vehicles are passing through the crossroad.As shown in Figure 10(d), HATA identifies well the objects with ID-3 and ID-7 in frame #98, and the objects with ID-3 and ID-10 in frame #119.

Figure 10 .
Figure 10.The results on 4 sequences from the SAIIP dataset.

preprints.org) | NOT PEER-REVIEWED | Posted: 13 July 2018 doi:10.20944/preprints201807.0238.v1
Peer-reviewed version available at Remote Sens. 2018, 10, 1347; doi:10.3390/rs10091347appearance.The parameters of the appearance model are re-initialized at each 5 frames to avoid large scale variation in both x and y directions.For the tracklet T i t ) to update the target's Preprints (www.

) Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 13 July 2018 doi:10.20944/preprints201807.0238.v1
Peer-reviewed version available at Remote Sens. 2018, 10, 1347; doi:10.3390/rs10091347In the above equation, ζ 2 s (T i t ) is an operator which returns a possible occluder tracklet T k t or ∅ to indicate that the occluder is an environmental obstacle.A tracklet T k t is defined as an occluder of T i t if the overlap ratio, ζ p (c i t , T k t ) (defined in Eq.(23)), between the bounding box of the FCT-based tracker c i t of T i t and the bounding box of the tracklet T k t is less than a given overlapping threshold th o , i.e. ζ p (c i t , T k t ) ≥ th o .The function dist(d

Table 2 .
Comparison of detection results with different detection thresholds T v θ .

Table 4 .
Comparison of tracking results on sequence EgTest02 with different detection thresholds T v

Table 5 .
Tracking results on the selected datasets.The best performing method is shown in Bold.