Integration of the 3D Environment for UAV Onboard Visual Object Tracking

Single visual object tracking from an unmanned aerial vehicle (UAV) poses fundamental challenges such as object occlusion, small-scale objects, background clutter, and abrupt camera motion. To tackle these difficulties, we propose to integrate the 3D structure of the observed scene into a detection-by-tracking algorithm. We introduce a pipeline that combines a model-free visual object tracker, a sparse 3D reconstruction, and a state estimator. The 3D reconstruction of the scene is computed with an image-based Structure-from-Motion (SfM) component that enables us to leverage a state estimator in the corresponding 3D scene during tracking. By representing the position of the target in 3D space rather than in image space, we stabilize the tracking during ego-motion and improve the handling of occlusions, background clutter, and small-scale objects. We evaluated our approach on prototypical image sequences, captured from a UAV with low-altitude oblique views. For this purpose, we adapted an existing dataset for visual object tracking and reconstructed the observed scene in 3D. The experimental results demonstrate that the proposed approach outperforms methods using plain visual cues as well as approaches leveraging image-space-based state estimations. We believe that our approach can be beneficial for traffic monitoring, video surveillance, and navigation.


Introduction
In recent years, unmanned aerial vehicles (UAVs) have expanded in usage conjointly with the number of applications they provide, such as video surveillance, traffic monitoring, aerial photography, wildlife protection, cinematography, target following, disaster response, and even delivery. Initially used in the military field, their use has gradually become widespread in the civil and commercial field, allowing new applications to emerge, which incorporate or eventually will incorporate visual object tracking as a core component.
Single visual object tracking is a long-studied computer vision problem relevant for many real-world applications. Its goal is to estimate the location of an object in an image sequence, given its initial location at the beginning. By integrating a state estimator in the tracking process, the tracking pipeline is referred to as detection-by-tracking; and without, as tracking-by-detection [1]. Despite solving challenging tasks  [4] in red (ground-level perspective), and the AU-AIR-Track dataset in blue (unmanned aerial vehicle (UAV) perspective). A detailed description of our dataset AU-AIR-Track is given in Section 4.1.
To better deal with the challenges of low-altitude UAV views, we propose a modular detection-by-tracking pipeline coupled with a 3D reconstruction of the environment. The core contributions of this work are as follows: (1) We propose a framework combining three main components. A visual object tracker, for modeling the appearance model of the object and inferring the position of the object in the image. A 3D reconstruction of the static environment, allowing us to associate pixel positions with a corresponding 3D location. Lastly, a state estimator-i.e., particle filter-for estimating the position and velocity of the object in the 3D reconstruction. (2) We show that the incorporation of 3D information into the tracking pipeline has several benefits. A 3D transition model increases the realism of the state estimator predictions, reflecting the corresponding object dynamics. The 3D camera poses allow to compensate for ego-motions, and the depth information improves the handling of object occlusions (see Figure 2). The proposed approach allows us to shift from tracking in 2D image space to tracking in 3D scene space (see Figure 3). (3) We improve the processing of false associations-i.e., distractors-through the usage of a multimodal state estimator. (4) We create a new dataset called AU-AIR-Track, designed for visual object tracking from a UAV perspective. The dataset includes 90 annotated objects as well as annotated occlusions and two 3D reconstructions of the static scenes observed from the UAV. (5) We demonstrate the effectiveness of our pipeline through quantitative results and qualitative analysis.
The paper is structured as follows. In Section 2, an overview of related work on single visual object tracking and their corresponding benchmarks is provided. The complete system and the individual components are explained in Section 3. We give a detailed description of the edited dataset used and present metrics with the evaluation protocol for assessing the performance of our approach in Section 4. Finally, in Section 5, we analyze quantitative as well as qualitative results and present the benefits of the design choices made. In the end, we discuss modifications that can be added to our approach in the Section 6 and give a conclusion in Section 7.

Related Work
In recent years, great progress regarding single visual object tracking has been made, owing to the abundance of benchmarks available [2,[4][5][6][7][8][9][10][11][12][13]. Most of them are designed towards evaluating tracking algorithms on ground-level perspectives, resulting in state-of-the-art visual object trackers following the tracking-by-detection paradigm. Currently, three tracking designs prevail on those benchmarks: the discriminative correlation filters [14][15][16][17][18][19][20]; the Siamese-based approach [21][22][23][24][25][26]; and recently, trackers inspired by correlation filters that employ a small convolutional neural network for learning the appearance model of the object [27,28]. In all three design choices, the main difference lies in how they learn an appearance model of the object. The latter style is explored in this paper.
These tracking algorithms are tailored to scenarios presenting ground-level perspectives, but onboard UAV perspectives present particular challenges. For instance, the object occupies a relatively small portion of the image space, resulting in a less-accurate learned appearance model. This leads to lower discrimination capabilities when similar objects to the tracked object are encountered. In particular, tracking an object with an oblique view on the scene from a UAV can present multiple occlusion situations compared to a top-down view. To analyze the performances of tracking algorithms in UAV scenarios, several benchmarks have been introduced [3,[29][30][31], ranging from low to high altitudes, and propose either an oblique view from the UAV or a top-down view. Most participating trackers in the SOT UAV benchmarks use state-of-the-art single visual object trackers presented in ground-level perspective benchmarks without or only with minor adaptations. However, none of the adapted trackers participating in the SOT UAV benchmarks attempt to utilize 3D information. This can be explained by the lack of such information being provided in the datasets. In contrast, the AU-AIR dataset [32] introduced recently is oriented toward object detection from a UAV viewpoint. It offers sequences that capture typical traffic on a roundabout, as would a surveillance drone for traffic monitoring, and contains sequences that are suitable for reconstructing the observed scene in 3D-sufficient translation movements, not flying around excessively, and enough structures in the environment.
In contrast to the SOT UAV benchmark approaches mentioned previously, there are different application domains such as autonomous driving, where objects are tracked in a 3D system of reference through a detection-by-tracking paradigm. Typically, the 2D image detections generated serve as measurements and are mapped from image space in the ego-motion-compensated reference system of the car. An example of applications for traffic monitoring is presented in [33], where the authors propose a pipeline including Multiple Object Tracking (MOT), stereo cues, visual odometry, optical scene flow, and a Kalman filter [34] to enhance tracking performance on the KITTI benchmark [35,36]. A follow up to this study was [37], which reconstructed the static scene and the object in 3D, allowing the shift from tracking in the 2D image space toward tracking in 3D scene space. In addition, the reconstructed object is associated with a velocity, inferred from the optical flow of the object, which is afterward associated with tracklets, thus enabling the authors to tackle occlusion situations and missing detections.
Most related to our work is the approach presented in [38], where the authors developed an MOT pipeline for UAV scenarios that also benefits from a 3D scene reconstruction for estimating the object location in 3D. In contrast to our work, the authors rely on the tracking-by-detection paradigm by leveraging RetinaNet [39] for detecting objects in the image sequence. They generate tracklets on image-level by integrating visual cues and temporal information to reduce false or missing detections. By projecting the image-based positions of detected objects on the estimated ground plane-inferred through visual odometry and multiview-stereo-the framework is able to assess their 3D positions. However, by using an object detector, the authors are only able to track object classes known by the object detector. In contrast, model-free trackers, i.e., SOTs, are able to track arbitrary objects.
We are convinced that tracking applications such as single object visual tracking from a UAV would also benefit from a shift towards a detection-by-tracking paradigm by incorporating 3D information.
In this paper, we apply a model-free single object visual tracker, implying that the tracker can only track a single object and starts with a blank appearance model-without an offline/pretrained appearance model. Regardless of the method used for training the tracker-i.e., offline, online-an appearance model of the object is used to locate the object in the image space. Here, we consider the state-of-the-art visual trackers ATOM [27] and DiMP [28] for appearance modeling.
Important for enabling the 2D to 3D mapping is a 3D representation of the observed scene. To this end, a Structure-from-Motion (SfM) or visual Simultaneous Localization and Mapping (SLAM) approach can be leveraged. SfM is a photogrammetric technique that estimates the 3D structures of a scene based on a set of images taken from different viewpoints [40][41][42][43][44][45]. Visual SLAM, similarly to SfM, reconstructs 3D camera poses and scene structures by leveraging specific properties of features in ordered image sets [46][47][48]. In addition, UAVs can improve the robustness of the reconstruction by associating an Inertial Measurement Unit (IMU) with the corresponding SfM or Visual SLAM algorithm [49][50][51][52][53][54][55]. However, in this work, an established software for SfM called COLMAP [40] is used to extract camera poses in the scene space and to reconstruct the static scene.
We estimate the state of the object in 3D space by relying on a state estimator, i.e., a particle filter [56]. For evaluating our approach, the publicly available dataset AU-AIR [32] is employed. The dataset is carefully further edited resulting in the AU-AIR-Track dataset, to best reflect prototypical occlusion situations from a low-altitude oblique UAV perspective.

Single Visual Object Tracking Pipeline for UAV
The designed framework is intended to be modular, allowing us to easily substitute different components or add other methods. The essential architecture of our approach is presented in Figure 4. On an incoming frame, the visual tracker defines a search area that is based on the previous estimated position and size of the object. The visual tracker then produces a similarity score map along with estimating an initial position and size of the object (x o i , y o i , w i , h i ) t in the current frame i at time step t. The 3D Context component estimates the location of the object in the 3D scene space through the similarity score map, the depth map approximation, and a state estimator-i.e., a particle filter. This allows the framework to distinguish the object from distractors and also to identify occlusions. Finally, the 3D position estimated by the framework is projected back in the image space as (x n i , y n i ) t , corresponding to the final estimated position of the object. It should be noted that no semantic information from the scene-i.e., the road-is used to facilitate the tracking process. Section 3.1 describes the visual tracker component. An overview of the mapping from the 2D image space to the 3D scene space is given in Section 3.2. The particle filter is described in Section 3.3. Lastly, we outline the details of our framework in Section 3.4.

Visual Appearance Modeling of the Object
In this work, we rely on two state-of-the-art visual trackers, ATOM [27] and DiMP [28]. The chosen trackers achieve top ranks against numerous participants on general visual tracking benchmarks [2,4,6,7,10,12,29]. Figure 5 shows the main components of a standard pipeline for single object tracking, leveraged by methods such as ATOM and DiMP.
ATOM includes three main components: A (1) Feature Extractor, a pretrained neural network-e.g., ResNet18, ResNet50 [57]-extracting salient features. A (2) Classification component, composed of a two-layer convolutional neural network, which is trained during the tracking process to learn an appearance model of the object. The Classification component proposes an initial estimation of the location and size of the object in the image as a bounding box. Lastly, a (3) Bounding Box regressor component, based on the IoU-Net [58] (trained offline), which refines the initially proposed bounding box.
DiMP is a successor of ATOM and builds on the same elements. The main difference lies in the extension of the Classification component. A new strategy for the initialization of the appearance model is used, expressing the appearance model with better weights at the start. The online learning process for updating the appearance model is also refined for faster and more stable convergence. Based on the extracted features, a similarity score map is inferred, used in estimating an initial bounding box of the object. Afterward, a refined bounding box is estimated trough the bounding box regressor, i.e., IoU-Net.
For each incoming frame, a search area is created to delimit the possible positions of the object in the frame. The positioning and size of the search area depend on the previous estimated position and size of the object. Based on the features extracted from the search area and the appearance model of the object, a similarity score map is inferred, reflecting the resemblance of the extracted features with the learned appearance model. The highest score in the similarity score map is designated as the position of the object due to the maximum-likelihood approach. The object estimation module-i.e., IoU-Net-is used to identify the best fitting bounding box and thus, to refine the estimated position. To cope with changes in the appearance of the object-e.g., illumination variations, in-plane rotation, motion blur, background clutter-the appearance model of the object has to be adapted during tracking. This adaptation is reached by updating the appearance model regularly in ATOM and DiMP. The updates occur every 10 valid frames for ATOM and every 15 valid frames for DiMP. A valid frame corresponding to a frame where the object has been identified correctly-with a high similarity score. The new appearance model is adapted by retraining the classification component with the search areas of those valid frames. The tracker deals with distractors by recognizing multiple peaks in the similarity score map and immediately updating the appearance model of the object with a high learning rate.

Point Cloud Reconstruction of the Environment
We employ Structure-from-Motion (SfM) for achieving the mapping between the 2D image space and the 3D reference system, i.e., scene space. Owing to our modular design, the mapping between the 2D image space and the 3D scene space can be replaced with an alternative photogrammetric technique. For this paper, we create a point cloud representation of the observed environment, which is sufficient for the demonstration of our approach.
We leverage COLMAP (in our experiments we used COLMAP 3.6, available publicly under https: //colmap.github.io/ (accessed on 14.10.2020)) [40,59], which is an established approach for SfM. For a given image set that contains overlapping images of the same scene and taken from different viewpoints, COLMAP automatically extracts the corresponding camera poses and reconstructs the scene structures in 3D. To achieve this, the library follows common SfM steps. (1) A correspondence search, where salient features from an image set are extracted and matched across the images, and incorrectly matched features are filtered out. (2) The scene is reconstructed as a point cloud by performing image registration, feature triangulation, and bundle adjustments. Figure 6a presents a point cloud reconstruction leveraged by COLMAP on an image sequence of AU-AIR-Track. The initial reconstruction of the scene contains noise near the camera frame and statistical outliers. Points near the camera frame are falsely triangulated and outliers are points correctly positioned but that are too sparse for reliably representing structures of the scene. To reduce the number of incorrect points in the reconstruction, we filter out points close to the camera frame and statistical outliers. We discard near-camera points from the point cloud by comparing their Euclidean distance with a threshold. For statistical outliers, we proceed as follows: (1) We define a neighborhood of 10 neighbors. The average distance d i of a given point i to its neighbors is calculated using the Euclidean distance. (2) A standard deviation threshold σ lim is defined and the overall average for all d i is computed as d avg . Points with an average distance d i / ∈ [d avg − σ lim , d avg + σ lim ] are identified as outliers and discarded from the point cloud. Figure 6b,c display the removal of points close to the camera frame and statistical outliers. Moreover, a plane ground (x g , y g ) is estimated by using the Random sample consensus (RANSAC) [60] method (see Figure 7). Points below the ground plane (x g , y g ) are discarded.  We compute depth map approximations based on the new scene reconstructed, i.e., the filtered point cloud with the estimated ground plane. The depth map approximations encode, for every pixel, the distance between the positions of the UAV and the visible points in the scene for the corresponding viewpoint, i.e., camera pose. For every frame in the AU-AIR-Track dataset, a corresponding depth map approximation is constructed by relying on the pinhole camera model; thus, enabling the mapping between the 2D image space and the 3D scene space. Figure 8 displays a depth map approximation example with the corresponding frame.

Particle Filter for Modeling the Object in 3D
In order to overcome the drawbacks of the common maximum-likelihood approach used in SOTs, we switch to detection-by-tracking by applying an additional state estimator. Additionally, to robustly handle multiple high similarity scores in the similarity score map, a state estimator with a multimodal representation of the probability density function is favored. To that end, a particle filter [56] is used, which will estimate the state of the object-i.e., the position and velocity-in the 3D scene for every time step t. The particle filter estimates the posterior density function of the object p(s t ) based on a transition model f , the current observation z t , and the prior density function p(s t−1 ). The general idea is shown in Equation (1): In order to approximate the probability density function p(s t ), the particle filter uses particles. Each particle denotes a hypothesis on the state. Particles from the prior probability density function p(s t−1 ) are propagated through the transition model f . Weights w i t at time step t are assigned to n particles with i ∈ I = {1, ..., n}, mirroring how strongly particles match with the current observation z t . Let p(s t |z 1:t ) represent the probability density function of the posterior state given all observations z 1:t up to time step t in Equation (2). Each particle has a corresponding weight w i t at time step t. Let s i t denote the hypothesis on the state of the i-th particle and s t the state estimation at time t. δ is the Dirac delta function. The weights are normalized such that ∑ n i w i t = 1. Particles are weighted according to their matching similarity with the current observation z t .
A resampling is performed after the weighting of the particles whenevern e f f , presented in Equation (3), is below a certain threshold. The resampling allows the particle filter to discard low-weighted particles and create new particles based on the stronger-weighted ones, allowing a refined approximation of p(s t |z 1:t ).n In our case, the observation z t is the similarity score map produced by ATOM or DiMP. We apply a constant velocity model for the transition model f and fix the velocity v z g on the z g axis, as v z g = 0-where z g is perpendicular to the estimated ground plane (x g , y g ) of the reconstructed scene. This results in particles being only able to move on the ground.

Tracking Cycle and Occlusion Handling
During initialization, the visual tracker learns an appearance model of the object based on the initial bounding box. At the same time, the position of the object in the image is projected onto the estimated ground plane of the 3D reconstruction, i.e., the scene space. To this end, we rely on the pinhole camera model for estimating the depth value of the object in the scene space. Supposing that the object does not leave the ground in the real world, we can assume that the object is bound to only move on the plane ground (x g , y g ).  (4) and simplifying it, we obtain Equation (6), allowing us to infer the depth value for the object based on its image coordinates (x o i , y o i ) t . Now that the depth value z c is estimated, we can determine the missing coordinates x o c and y o c of t c t based on Equation (5). By applying a rigid body transformation H c w from the camera to the world frame of reference, we obtain the position of the object in the scene space as The projected position t w t for the first frame t = 1, expressed in 3D coordinates, is considered as the initial position for the state s −1 . In the following frame, the visual tracker component computes a bounding box delimiting the position and size of the object in the frame. Additionally, we extract the current search area and the similarity score map. Particles are generated uniformly onto the ground of the 3D reconstruction but are delimited on the projected surface of the search area. Particles are then weighted accordingly to the similarity score map of the visual tracker component. Following is a resampling step, which shrinks the possible locations where the object might be located by regenerating new particles where previous high-weighted particles were located. In consequence, particles are mostly located around a high similarity response, giving us an estimated position of the object in 3D. The current 3D position (x w , y w , z w ) t along with the previous 3D position of state s −1 are used to determine an initial velocity. As a result, an initial state s 0 for the object is estimated with velocity and position.
In the following frame, the particle filter can be used to predict the position of the object in the scene space as (x p w , y p w , z p w ) t . To deal with the uncertainties in the transition model-i.e., the constant velocity model-a Gaussian noise term is added along x g and y g .
After initialization, our tracking framework enters an online tracking cycle. Figure 4 presents the essential architecture of the framework during online tracking. On an incoming frame, the visual tracker component defines a search area and produces a similarity score map along with estimating a bounding box for the object on frame t. During the 3D Context computation step, we estimate the 3D position of the object in the scene space, distinguish the object from distractors, and recognize occlusions. The new 3D position of the object (x w , y w , z w ) t expressed in the scene space is projected back onto the frame t, corresponding to the new estimated position (x n i , y n i ) t of the object in image space. Figure 9 displays the different building blocks of the 3D Context component, which contains the particle filter that estimates the object state, i.e., the 3D position and 3D velocity in the scene space. In the first step, the particle filter predicts particles on an incoming frame. The predicted particles are clustered in the image space. We then identify the cluster representing the object based on how close each cluster is to the previously estimated position of the object in the scene space.
The occlusion identifier component identifies occluded particles composing the object cluster through the depth map approximation. To this end, we compare the depth value of the predicted particles from the perspective of the UAV and the depth value from the depth map approximation. We identify particles hidden by structures if there is a discrepancy between the predicted depth and the observed depth. We considered that if 50% of the particles from the object cluster are occluded, then the object is also occluded. Similarly, the object is automatically considered as hidden if the similarity score map is flat and widely spread over the search area. Should the object be identified as occluded, the tracking framework would rely on the predictions provided by the particle filter (x p w , y p w , z p w ) t . To identify the reappearance of the object, a high similarity score close to the expected position is required. When more than 50% of the predicted particles are not labeled as occluded, we consider that the object is potentially visible again. A specific threshold for the similarity score map is set to consider the object as being truly visible again. A similarity score greater-than or equal-to this threshold must be reached to consider updating the particle filter with the observation, i.e., the similarity score map. If distractors are present when the object reappears, then the group of particles closest to the prediction is considered to represent the object. In case the object is considered as occluded, the 3D coordinate of the prediction provided by the particle filter (x p w , y p w , z p w ) t is used as the presumed position of the object in the scene space and is reprojected in the image as (x n i , y n i ) t . In case the object is not identified as occluded, an update and resampling are performed on the particles before clustering them for a second time. Following is the determination of the cluster representing the object. Lastly, the 3D coordinates (x u w , y u w , z u w ) t of the cluster, modeling the object in the scene space, are reprojected in the image as (x n i , y n i ) t .
When the object is visible, a second step is to update the belief of the particle filter with the observation, i.e., similarity score map. However, before updating a small percentage, i.e., 10%, of particles are uniformly redistributed across the projected search area on the ground. This redistribution ensures that we maintain a multimodal distribution. Without redistributing a portion of the particles, particles would clump around the object and only a small portion of the similarity score map would be considered for updating the weights. The weights of the particles are updated by projecting their 3D position in the image space, allowing us to weigh them accordingly to their 2D location in the similarity score map. We then resample particles through stratified resampling [61]. By using a particle filter, we can model multimodal tracking, preventing instantaneous switching from object to distractors as illustrated in Figure 10.
After particles are resampled, the framework clusters them based on their position in the image space. We identify the cluster describing the object, by comparing the position (x c w , y c w , z c w ) for every cluster c against the predicted position of the object (x p w , y p w , z p w ) t . The closest cluster to the predicted position is considered to describe the object state. Thus, the new/updated position of the object is based on this identified cluster as (x u w , y u w , z u w ) t . The coordinates in the scene space are then projected in the image space as (x n i , y n i ) t and used for updating the appearance model of the visual tracker component. This avoids adding incorrect training samples, i.e., distractors. Figure 10. The small green bounding box represents the estimated bounding box of the unmodified visual object tracker and the red bounding box represents the estimated bounding box of a visual object tracker coupled with a particle filter. The large green bounding box corresponds to the search area of the visual tracker. In this scenario, the similarity score map has three peaks. Due to the maximum-likelihood approach of the visual tracker, it mistakes a distractor with the object; whereas the visual tracker coupled with a particle filter manages to stay on the object, even though the highest similarity score is attached to a distractor.

Dataset and Evaluation Metrics
In Section 4.1, we present our edited AU-AIR-Track dataset. In Section 4.2, we define the metrics used for evaluating the trackers on the dataset.

AU-AIR-Track Dataset
Using our approach, we want to tackle occlusion occurrences, false associations, and ego-motion using a 3D reconstruction of the static scene. To that end, we need a UAV dataset that provides visual object tracking annotations with 3D reconstructions of the scene. As stated before, current UAV datasets [3,[29][30][31] with visual tracking annotations do not provide 3D information. Therefore, we created the AU-AIR-Track dataset (AU-AIR-Track alongside AU-AIR [32] are available under https: //github.com/bozcani/auairdataset (accessed on 14.10.2020)) which includes the following: bounding box annotations with identification numbers, occlusion annotations, 3D reconstructions of the scene with the corresponding depth map approximations, and camera poses.
AU-AIR-Track is distilled from the AU-AIR dataset [32], which provides real-world sequences suitable for traffic surveillance and reflects prototypical outdoor situations captured from a UAV. AU-AIR contains sequences taken from a low flight altitude ranging from 10 to 30 m, under different camera angles ranging from approximately 45 to 90 degrees. For each frame, the dataset provides recording time stamps, Global Positioning System (GPS) coordinates, altitude information, IMU data, and the velocity of the UAV. A criterion in favor of AU-AIR is the low range altitude flights with the oblique point of view towards the scene it provides, the multiple occlusions offered by the tree in the roundabout and the duration of a scene is observed.
The AU-AIR-Track dataset consists of two sequences, designated as 0 and 1. With a total of 90 annotated objects, sequence 0 contains 887 frames and 63 annotated objects; and sequence 1 has 512 frames with 27 annotated objects. Both sequences have been extracted at 5 frames per second and their resolution is 1920 × 1080 and 1922 × 1079 pixels, respectively. Figures 11 and 12 display a few images from both sequences, which present only oblique views of the scene taken from a nonstationary UAV.  As a result, the main challenges captured in AU-AIR-Track are the constant camera motion, the low image resolution, the presence of distractors, and most importantly, frequent object occlusions. Since AU-AIR annotations are designed for object detection, we adapted them in AU-AIR-Track for visual object tracking. Figure 13 presents the original AU-AIR annotations and the adapted annotations for AU-AIR-Track (where only moving objects are annotated). Figure 14 shows the distribution of the ground-truth bounding box locations for both sequences. The value of each pixel denotes the probability of a bounding box to cover that pixel over an entire sequence. It can be seen that most objects follow the underlying scene structure, i.e., the road. From the 63 possible objects present in sequence 0, 45 objects undergo an occlusion and 25 out of 27 objects in sequence 1. As stated before, AU-AIR-Track provides the 3D reconstructions of both sequences (see Figure 15). The 3D information available in our dataset are the sparse reconstructions with their respective ground plane estimations, camera poses, fundamental matrices, and transformation matrices.

Evaluation Metrics
Similar to long-term tracking, the tracked object can disappear and reappear. Thus, no manual reinitialization is done when the tracker loses the object. To measure the performance, we utilize common long-term metrics-tracking precision Pr, tracking recall Re, and tracking F1-score F(τ θ )-at a given τ θ , introduced in [62] and used in [2,6].
Let G t be the ground truth object bounding box, and A t the bounding box estimation given by the tracker at frame t. Further, let θ t denote the prediction confidence, which in our case, is the maximal score given by the tracker regarding its confidence on the presence of the object in the current frame t. If the object is partially/fully occluded, we set G t = 0; and similarly, if the trackers predicts an object with a confidence θ t below τ θ , we set A t = 0. Furthermore, let n g denote the number of frames with G t = 0, where t g ∈ N g = {1, ..., n g }, n p is the number of frames with A t = 0, and G t = 0 with t p ∈ N p = {1, ..., n p }. Lastly, Ω(A t (θ t ), G t ) describes the intersection-over-union (IoU) between G t and A t .
Re(τ θ ) = 1 The combination of tracking precision Pr(τ θ ) and tracking recall Re(τ θ ) as a single score is defined as the tracking F1-score F (τ θ ) [62]. Similarly to the long term challenges presented in [2,6], the final tracking F1-score is used to rank the different tracking algorithms.
The evaluation protocol is as follows: the trackers are evaluated on all objects present in the AU-AIR-Track. The annotated first frame of the object is used to initialize the tracker. From there, the tracker outputs a prediction bounding box for every subsequent frame where the object is annotated-even during occlusions, no reset is allowed. Tracking precision, tracking recall, and tracking F1-score are computed accordingly to Equations (7)- (9). To avoid statistical errors caused by the classification component of the visual tracker, which describes the appearance of the object through learned weights, we run an evaluation of every tracker five times on both sequences of the AU-AIR-Track. For every evaluation e and for every object i present in a sequence, we take the maximum tracking F1-score for that object f max e i . Considering the maximum tracking F1-score, regardless of τ θ , allows us to examine how the tracker would work without human intervention.
For computing the final F1-score f final of a tracker on a sequence of AU-AIR-Track: (1) we average maximum tracking F1-scores f max

Results
In this section, we demonstrate the effectiveness of our approach on quantitative results in Section 5.1 and through a qualitative analysis in Section 5.2.

Quantitative Results
In this section, we use the following terms: (1) "original", which refers to the unmodified visual object tracker ATOM and DiMP presented in [27,28]. (2) "2D variant", denoting ATOM and DiMP-i.e., ATOM-2D, DiMP-2D-coupled with a particle filter, working in the 2D image space. (3) Lastly, the "3D variant" refers to ATOM and DiMP-i.e., ATOM-3D, DiMP-3D-utilizing 3D information combined with a particle filter operating in this 3D scene space. Table 1   Based on Table 1, we observe that the original variations of ATOM and DiMP attain the lowest scores and the least stable results. As stated before, the original methods are designed for short-term tracking from a ground-level perspective by relying on visual cues. The original variants have no specific module integrated for handling partial or full occlusions apart from using a similarity score map with a threshold. If the similarity score map has no peaks (under a set threshold) and is widely spread, the tracker is able to recognize that the object is missing but is not able to predict the next position. The original variation is also more prone to switching the object with a false association-i.e., the distractor-because of the maximum-likelihood approach. Since the original trackers rely only on the learned appearance model, they are extremely dependent on the number of pixels that encode the object. This results in the tracker losing most of the tracked object when they are described with a low amount of pixels, i.e., small-scale objects.
ATOM-2D and DiMP-2D also recognize occlusion only through visual cues, achieved by setting a minimum required similarity score as a threshold. Occlusion is identified by obtaining a similarity score that is below the set threshold. During occlusion, the position of the object cannot be inferred visually but can be estimated (to some extent) through the predictions of the particle filter. Relying solely on visual cues for identifying occlusion is limited since the tracker potentially misinterprets a fast appearance change with an occlusion. Another limitation is that only occlusions without ego-motion can be handled because the particle filter estimates positions in the 2D image space. Overall, there is an increase in the tracking F1-score compared to the original variations, but this gain is essentially due to better recognizing and handling of distractors and small-scale objects, leveraged by a multimodal state estimator. With the multimodal property, we allow groups of particles to form where high responses in the similarity score map are found. The different groups of particles are clustered depending on their locations, enabling the 2D trackers to consider and distinguish the object and distractors. Whereas, the original trackers, utilizing a maximum-likelihood approach, can only handle the highest response in the similarity score map. Using a state estimator also offers the benefit of being less dependent on the number of pixels used for encoding the object.
Regarding ATOM-3D and DiMP-3D, this variation achieves the best performance on the AU-AIR-Track dataset. ATOM-3D and DiMP-3D can identify occlusions not only based on visual cues but also through depth information leveraged by the 3D reconstruction of the scene (depth map approximations). This allows them to recognize a hidden object more reliably than the previous variations. Using a particle filter in the 3D scene space enables the usage of a 3D transition model, which adequately describes real-world motions in comparison to a particle filter in 2D image space. This results in improved stability of corresponding predictions w.r.t. ego-motions. Thus, the predictions of the state estimator are more accurate than in the 2D variants when the object is hidden, allowing them to potentially estimate the position of the object for a longer period. Additionally, in this variation, the particle filter enhances the ability (similarly to ATOM-2D and DiMP-2D) of the tracker to distinguish the objects from distractors and to be less dependent on the number of pixels describing the object. Based on these results, the ATOM-3D and DiMP-3D display better performance in comparison to the original and 2D variants.

Qualitative Analysis
In this section, we discuss selected qualitative tracking scenarios to verify the overall viability of the different methods. To illustrate the results, an in-depth look is provided by examining the variants of DiMP. For every figure, a closer look at the scenario is shown on the lower part of the figure, based on a region delimited by a red dash-dot rectangle in the corresponding upper image. Figures 16 and 17 display scenarios where the object is lost by the original DiMP variant in contrast to DiMP-2D and DiMP-3D. In the first scenario, the original variant, which uses a maximum-likelihood approach, misinterprets the distractor with the object because of the low number of pixels encoding the object. In contrast, by leveraging a particle filter, DiMP-2D and DiMP-3D handle the presence of distractors better and are less prone to fail on small-scale objects. The second scenario presents an occlusion situation, where the object is hidden by the tree in the roundabout. Both DiMP-2D and DiMP-3D recognize occlusion and can rely on their respective particle filter for predicting the position of the hidden object, until its reappearance. Figure 16. Comparison of the evaluated trackers. DiMP-2D and DiMP-3D are able handle the presence of a distractors by taking advantage of the multimodal representation provided by a particle filter. In contrast, using the maximum-likelihood approaches, DiMP switches to a distractor.  Figure 18 shows three consecutive frames where only DiMP-3D is unaffected by the ego-motion and can track the object robustly because the position of the object is expressed in the 3D scene space. Figure 19 illustrates another scenario where both DiMP-2D and DiMP-3D recognize the object as hidden, but while predicting the position of the hidden object; only DiMP-3D can predict robust and reasonable positions for the object even during camera motion. Relying on an image-based state estimator is only viable when ego-motion is very minimal, as in Figure 17. In Figure 20, solely DiMP-3D can identify the object undergoing occlusion, owing to the depth information leveraged by the depth map approximations in addition to the visual cues; whereas DiMP and DiMP-2D switch to a distractor because they solely rely on visual cues. Figure 18. Comparison between DiMP-2D and DiMP-3D. During camera motion DiMP-2D is unable to track the object; whereas DiMP-3D is able to stabilize the tracking by expressing the object position in the 3D scene space. The light-blue grid is drawn to help visualize ego-motion. Figure 19. Comparison between DiMP-2D and DiMP-3D. Only DiMP-3D is able to robustly track the object during ego-motion while the object is occluded. The light-blue grid is drawn to help visualize ego-motion. Despite DiMP-3D and ATOM-3D achieving remarkable results, there are cases where both fail. Occlusion on rare occasions is not identified correctly because of a strong distractor present when the object is partially hidden. This can be prevented by elaborating a different strategy for recognizing occlusions and reappearances of the object. Another point limiting the performance of DiMP-3D is illustrated in Figure 21, where the object slows down at the intersection for a long period. While the object is not moving, the particle filter continuously updates the estimated velocity to be adequate with the observations (velocity near zero). When the object accelerates, the particle filter cannot match the speed instantly, due to the transition model. Figure 21. Comparison between DiMP-2D and DiMP-3D. During the acceleration phase of the object, the particle filter estimates the velocity of the object with a delay due to the transition model adopted.

Discussion
Besides the challenges arising from the specific characteristics for single visual object tracking from UAVs, the use of computer vision approaches onboard a UAV additionally faces the problem of finding an adequate compromise between computational complexity and real-time capabilities with extreme resource limitations on the platform. Although, being not in the scope of this paper, we explore in this section alternative design choices for integrating the current pipeline onto a UAV. Owing to the modular design of our pipeline, we can replace individual components with variants that are less cost-intensive.
Regarding the Visual Object Tracker component, both ATOM [27] and DiMP [28] have real-time capabilities but rely on a Graphics Processing Unit (GPU) for inferring the position of the object. To diminish the amount of space required and to increase the run time, the original feature extractor-i.e., ResNet-18, ResNet-50 [57]-could be replaced with MobileNet [63], which is specifically designed for embedded vision applications and mobile devices. Alternatively, it is possible to replace ATOM and DiMP with another type of tracker. Since both trackers update the appearance model of the object on-the-fly, they require more GPU capacities than other methods not updating the appearance model such as Siamese SOT [21,[23][24][25].
In our work, we focused on using image-based scene reconstruction by leveraging a SfM-based method [40]. To attain real-time performance, a Visual SLAM-based method that can associate IMU formations [49][50][51][52][53][54][55] is preferred for robustly reconstructing the 3D environment on-the-fly. A concern for the sparse reconstruction might be the storage and the processing time needed when the UAV is observing a large area. To reduce the required storage space needed, the point cloud can be reconstructed or partially loaded, depending on the current UAV position in the scene [64].
Since the multimodal representation of the probability density function is indispensable, for identifying distractors in the search area, a particle filter [56] is utilized in our detection-by-tracking pipeline. Thus, using a particle filter over other state estimators such as the Kalman filter [34] is crucial for the proposed pipeline, despite being computationally more demanding. Although numerous particle filter implementations do not perform well with a high number of particles, this is not necessarily a general limitation of the approach [65]. In an effort to reduce the computational time, authors from [66] elaborate faster methods for the resampling step compared to common resampling approaches.

Conclusions
In this paper, we propose an approach to improve UAV onboard single visual object tracking. To this end, we combine information extracted from a visual tracker and 3D cues of the observed scene. The 3D reconstruction allows us to estimate the state in a 3D scene space rather than in a 2D image space. Therefore, we can define a 3D transition model reflecting the dynamics of the object close to reality.
The potential of the approach is shown on challenging real-world sequences, illustrating typical occlusion situations captured from a low-altitude UAV. The experiments demonstrate that the presented framework has several advantages and is viable for UAV onboard visual object tracking. We can effectively handle object occlusions, low object sizes, the presence of distractors, and reduced tracking errors caused by ego-motion.
A part of our future work will be to exploit a dense reconstruction rather than a sparse reconstruction; explore different state estimators and add more context to the scene, such as the layout of the road in the reconstruction; and to integrate real-world coordinates through georeferencing.