High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts

Zhou, Shichao; Fan, Xiangpan; Wang, Zhuowei; Wang, Wenzheng; Zhang, Yunpu

doi:10.3390/rs17132237

Open AccessArticle

High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts

by

Shichao Zhou

¹

,

Xiangpan Fan

¹

,

Zhuowei Wang

¹,

Wenzheng Wang

^2,*

and

Yunpu Zhang

¹

School of Information Communication Engineering, Beijing Information Science and Technology University, Beijing 100192, China

²

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2237; https://doi.org/10.3390/rs17132237

Submission received: 9 May 2025 / Revised: 21 June 2025 / Accepted: 27 June 2025 / Published: 29 June 2025

(This article belongs to the Special Issue Object Detection and Information Extraction Based on Remote Sensing Imagery (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Visual tracking from the unmanned aerial vehicle (UAV) perspective has been at the core of many low-altitude remote sensing applications. Most of the aerial trackers follow “tracking-by-detection” paradigms or their temporal-context-embedded variants, where the only visual appearance cue is encompassed for representation learning and estimating the spatial likelihood of the target. However, the variation of the target appearance among consecutive frames is inherently unpredictable, which degrades the robustness of the temporal context-aware representation. To address this concern, we advocate extra visual motion exhibiting predictable temporal continuity for complete temporal context-aware representation and introduce a dual-stream tracker involving explicit heterogeneous visual tracking experts. Our technical contributions involve three-folds: (1) high-order temporal context-aware representation integrates motion and appearance cues over a temporal context queue, (2) bidirectional cross-domain refinement enhances feature representation through cross-attention based mutual guidance, and (3) consistent decision-making allows for anti-drifting localization via dynamic gating and failure-aware recovery. Extensive experiments on four UAV benchmarks (UAV123, UAV123@10fps, UAV20L, and DTB70) illustrate that our method outperforms existing aerial trackers in terms of success rate and precision, particularly in occlusion and fast motion scenarios. Such superior tracking stability highlights its potential for real-world UAV applications.

Keywords:

unmanned aerial vehicle; low-altitude remote sensing; optical tracking; temporal reasoning; motion analysis; decision-making

Graphical Abstract

1. Introduction

Aerial tracking is devoted to estimating the states of target (e.g., position and scale) over videos captured by an onboard camera in an unmanned aerial vehicle (UAV) platform. Benefitting from the unique bird’s-eye view and wide field, aerial tracking has found wide utility in a plethora of earth observation applications [1,2], such as environmental monitoring [3], disaster management [4,5], and precision agriculture. Because of the inevitable mobility of the UAV, there is an inherent need for aerial trackers that are robust to the appearance variations of the target caused by viewpoint changes, frequent occlusions, fast motion, and illumination shifts, to name but a few.

One drastic way to overcome the appearance variation is the straightforward exploration of “tracking-by-detection” paradigm, where discriminative appearance representations are learned to differentiate target candidates from complex terrain clutters frame by frame. Early efforts of discriminative correlation filter (DCF)-based trackers [6,7,8,9] are treated as major milestones, whose progress is attributed to approximating geometric translation of appearance by circularly shifting the target template and then simplifying the Moore–Penrose inverse-based learning stage with element-wise operations in Fourier domain. Such shallow appearance classifiers with fast online parameter adaption admits deployment on resource-constrained UAV platforms. Further, motivated by the modern data-driven representation learning, the Siamese-based tracker and its variants [10,11] establish over-parametric metrics regarding target appearance, where visual similarity between the initial exemplar and larger search image is densely evaluated with cross-correlation. Although the deep tracker tends to be criticized for time-consuming feedforward computation, its “uniform” convolution operation can be executed in parallel with the help of onboard hardware accelerators and thus balance the tradeoff between offline appearance discriminability and online inference efficiency.

An alternative but even more rational tracking strategy is to work with temporal contexts of the visual appearance. Instead of “instance detection” for each frame independently, the temporal context-aware trackers could benefit from consecutive frame information and then progressively update appearance parameters to handle the target appearance variations or adjacent similar distractors over time. For instance, DCF-based methods [12,13] integrate temporal regularization, enforcing appearance consistency over time and adaptively updating templates to accommodate tracking drifts. Memory-augmented trackers [14,15] instantiate external memories to store historical target states, mitigating appearance shifts by recalling past appearances and long-term patterns. Also, the empirical success of transformer-based detectors enlightens the recent top-performing trackers [16,17] turn to leverage the best of both worlds: deep-learned appearance features and the attention-based long-range dependency of spatial–temporal contexts.

While effective, the aforementioned aerial trackers still suffer from an inherent unpredictability of the target appearance among consecutive frames. Different from the geometric motion cue that follows physical rules and exhibits temporal continuity, the target appearance, as a semantic description from the aerial perspective, is more sensitive to the aerial challenging attributes: (1) severe viewpoint variations from UAV maneuvers cause drastic appearance changes, (2) illumination conditions fluctuate significantly in outdoor environments, (3) partial or complete occlusions temporarily eliminate robust appearance cue, and (4) appearance-similar objects in cluttered background create ambiguity, as shown in Figure 1. Despite the motion cues (e.g., optical flow field) being incorporated into several temporal context-aware trackers [18,19,20], most of the explorations are restricted to auxiliary inference for the appearance association. Such under-utilization of temporal contexts results from two technical issues: naive temporal invariance assumption and inherent heterogeneity between the appearance and motion cues.

To address these concerns, we present high-order temporal context-aware representation for jointly exploiting heterogeneous visual cues in Siamese-based aerial tracking framework. Beyond appearance consistency constraints between adjacent frames, we explicitly predict and recalibrate the similarity map in the latent representation space with the appearance and motion cues of targets over a temporal context queue. Specific contributions of this paper are summarized as follows:

High-Order Temporal Context-Aware Representation: We propose a dual-stream framework that enriches motion and appearance features with temporal contexts. Motion features, extracted from optical flow and motion-history images (MHIs), capture dynamic trajectories, while appearance features, refined through temporal-adaptive convolutions, adapt to appearance variation. Both enhanced cues improve robustness against drifting or disruption commonly encountered in aerial tracking.
Heterogeneous Feature Representation Refinement: We leverage a bidirectional cross-domain interaction that equally refines the motion and appearance features. Motion guides appearance to ensure temporal consistency, while appearance calibrates motion to suppress noise, creating a cohesive representation resilient to occlusions and complex backgrounds.
Consistent Decision Fusion with Smooth Constraint: We introduce a decision module that dynamically judges the refined feature pairs with velocity-based gating and enforces consistent inference with smooth spatial–temporal constraints so as to avoid unreasonable decision switch between the heterogeneous visual experts.

Extensive evaluations on UAV benchmarks demonstrate that our method achieves 0.616 success rate on UAV123, outperforming pure appearance-based trackers like SiamRPN [21] and SiamAPN++ [11] by 0.035 and 0.037, respectively, in general scenarios. In challenging occlusion scenarios, our approach maintains a 0.563 success rate, surpassing TCTrack [16] by 0.056. The motion stream contributes significantly to the performance gain, particularly in fast-motion cases where we achieve 0.725 success rate compared to 0.637 without motion modeling, validating the necessity of explicit motion context integration.

The remainder of this paper is organized as follows. Section 2 introduces the related works. Section 3 details our proposed framework and methodology. The performance comparison with other SOTA trackers is reported in Section 4, including experimental setup, quantitative results across four UAV benchmarks, and qualitative analysis and ablation studies. Discussion is presented in Section 5. Finally, conclusions and future work are presented in Section 6.

2. Related Works

2.1. Visual Tracking in Aerial Remote Sensing

Early explorations of aerial tracking typically employed a correlation filter (CF) [6,8] or its variants due to their remarkable computational efficiency (i.e., moderate FLOPs and memory consumption). Technically, the CF trackers approximate conventional dense sampling strategy by circularly shifting the target templates and thus learn the appearance filter in the Fourier domain with the property of the circulant matrix. However, due to the lack of robust feature representation of target appearance, the learned correlation filter, as a shallow binary decision hyperplane, often struggles to discriminate the target from complex backgrounds.

Recently, the advancement of deep representation learning has revolutionized visual tracking with the Siamese-based paradigm. SiamFC [10] pioneered this road by formulating tracking as appearance similarity learning between a template and search region. Subsequent works like SiamRPN++ [21] and SiamAPN++ [11] incorporated region proposal networks and attention mechanisms to enhance localization accuracy. While these approaches demonstrated effectiveness in aerial footage, their obsessive concentration on the appearance representation would inevitably be sensitive to the distractor (i.e., similar object) adjacent to the true one.

2.2. Temporal Context in Visual Tracking

Beyond the straightforward “tracking-by-detection” paradigm, temporal contexts have attracted lots of attention and have been integrated into existing visual tracking frameworks. For instance, STRCF [12] and ARCF [22] introduced spatial temporal regularization in correlation filters to enforce consistency among consecutive frames. For the purpose of efficient convex optimization, they typically focus on penalizing abrupt appearance changes (i.e., smooth constraints) rather than capturing explicit temporal dynamics.

Recent deep-learning-based trackers have attempted to conduct sequence modeling through various strategies. Template update mechanisms, such as UpdateNet [13], employ a dedicated network to predict template updates based on historical tracking results. This approach utilizes cumulative learning, where new templates are computed as weighted combinations of previous ones. However, due to changes in viewpoint or occlusion, visual cues become unreliable, resulting in tracking drift. Memory-augmented tracking methods, such as MemTrack [14] and STMTrack [15], introduce external memory mechanisms. MemTrack maintains memory banks to store historical target representations, enabling the retrieval of past appearances during challenging scenarios. STMTrack extends this concept with space–time memory networks that jointly model spatial and temporal contexts across multiple time scales. They can maintain the target representation during temporary visual interruptions and use memory help to recover from occlusion in UAV scenes. Recent transformer-based trackers, such as TCTrack [16], leverage multi-level temporal convolutions and attention mechanisms to capture long-range dependencies. While these methods demonstrate effectiveness in temporal context modeling, they primarily focus on appearance-based temporal consistency rather than explicit motion–appearance fusion.

The aforementioned trackers motivate us to utilize extra visual cues with high-order constraints against intractable appearance variations.

2.3. Motion Cue in Visual Tracking

Compared with the variable appearance, motion cue can be treated as a more predictable and reliable temporal context for visual tracking. Regretfully, unlike the multi-object tracking task where explicit motion modeling is routine for target association, the single-object trackers rarely fully leverage the motion cue. Conventionally, early efforts employ Kalman or particle filters to represent motion cue, yet are inadequate for complex aerial dynamics. As for single-object tracking, FlowTrack [18] incorporates optical flow as an additional input stream along with RGB data (i.e., raw appearance). More recently, approaches such as TrackLSTM [19] and ATOM [23] have advanced temporal modeling by alternatively employing recurrent networks or Intersection-over-Union (IoU) prediction along the temporal dimension. TrackLSTM demonstrated how LSTMs can implicitly learn complex temporal dependencies from feature sequences, offering improved robustness to short-term occlusions. ATOM, by predicting IoU over time, refined bounding box estimation with a temporal awareness. These methods predominantly perform an implicit motion analysis rather than explicitly exploiting motion as an independent predictive signal.

However, most of the methods treat the motion as an individual cue or auxiliary one, limiting robustness to occlusions or fast motion. The rapid viewpoint changes from UAV maneuvers demand trackers capable of handling drastic appearance variations, where motion-based cues can provide trajectory consistency when appearance cues become unreliable. In scenarios characterized by fast target movement or low frame rates, appearance information between consecutive frames can be scarce or heavily blurred. Motion cues can effectively predict the target’s trajectory and infer its likely position, compensating for sparse appearance data and mitigating the impact of motion blur and rapid ego-motion of the UAV. Such concern inspires us to equally treat motion and appearance as complementary temporal contexts, enabling robust tracking when either one changes.

2.4. Heterogeneous Feature Fusion in Visual Tracking

Heterogeneous representation of the appearance and motion challenges dealing with both cues into one unified framework. Straightforward concatenation [24] or weighted summation [25] often degrade motion contributions in appearance-dominant scenarios. Attention-based fusion, as in SiamAttn [26] and SiamGAT [27], emphasizes spatial contexts but neglects the interaction between motion and appearance. Transformer-based approaches like TransT [28] and STARK [17] only enhance the feature integration module to improve tracking robustness. Additionally, MixFormer [29] through multi-scale feature fusion and spatio-temporal relationship modeling can effectively capture the dynamic changes of the target in long time series. TATrack [30] introduces an attention-based multimodal feature fusion strategy, adaptively adjusting the influence of different modalities. However, they still focus on appearance without dedicated motion representations.

In contrast to previous approaches, we treat appearance and motion as equally important cues and enable them to interact with a bidirectional cross-domain interaction module, where the query, key, and value tuple could be inherently heterogeneous. When appearance becomes unreliable during occlusions, our motion stream can guide target localization. Conversely, when motion becomes ambiguous due to background distractors, appearance features provide discriminative filtering. As such, the bidirectional reasoning mechanism, combined with our adaptive temporal enhancement strategy, enables more robust tracking in challenging aerial scenarios where both appearance and motion may undergo variations.

3. Proposed Method

In this section, we instantiate a Siamese-based aerial tracking framework illustrated in Figure 2. It consists of three modules: high-order temporal-aware representation, heterogeneous feature refinement, and consistent decision fusion with smooth constraint. Each of the modules will be detailed in the following subsections.

3.1. High-Order Temporal-Aware Representation

This module encodes both motion and appearance cues across consecutive frames to lay the foundation for the feature representation of targets. Specifically, by analyzing a short sequence of preceding frames with dual-branch architectures, it could capture high-order temporal contexts on both dynamic motion patterns and appearance variations. Such high-order temporal-aware representation enhances the tracker’s robustness to occlusions, motion blur, and viewpoint variations from the aerial view.

3.1.1. Motion Representation

To capture the target’s motion trajectory, we first represent the motion pattern with the optical flow on the motion branch. Given two consecutive frames

I_{t - 1}

and

I_{t}

, the Farneback algorithm [31] could induce dense optical flow field at each pixel:

\begin{matrix} \vec{v} (x, y) = Farneback (I_{t - 1}, I_{t}) \end{matrix}

(1)

where

\vec{v} (x, y)

represents the pixel-wise optical flow vector, capturing the displacement at location

(x, y)

. This fine-resolution motion cue allows the tracker to distinguish the movement of target from background clutter.

Then, to integrate longer-term motion contexts, we construct a motion-history image (MHI) by accumulating recent motion responses:

\begin{matrix} {MHI}_{t} (x, y) = \{\begin{matrix} τ & if | \vec{v} (x, y) | > θ \\ max (0, {MHI}_{t - 1} (x, y) - δ) & otherwise \end{matrix} \end{matrix}

(2)

where

τ

represents the current timestamp,

θ

is a threshold for identifying noticeable motion, and

δ

is the temporal decay factor. This formulation amplifies and attends to recent motion responses while suppressing the historical ones.

The obtained optical flow and MHI are then fed into a conventional convolutional neural network (CNN) to extract high-order temporal-aware motion features

f_{mot} \in R^{C \times H^{'} \times W^{'}}

, where C is the number of channels, and

H^{'} \times W^{'}

denotes spatial dimensions.

3.1.2. Appearance Representation

In parallel, the appearance branch encodes the target’s spatial layout. To be specific, we define an initial feature map

f_{app}^{0} \in R^{C \times H^{'} \times W^{'}}

that is extracted using a base convolutional layer to represent target appearance (such as the local texture and structural patterns, etc.) within current frame

I_{t}

.

To accommodate temporal appearance variations, we utilize a temporal-adaptive convolution (TadaConv) [32] for the processing of the following frame. It modulates the convolution kernel dynamically based on inter-frame differences:

\begin{matrix} K_{t} = K_{base} + {Conv}_{adapt} (I_{t} - I_{t - 1}) \end{matrix}

(3)

where

K_{base}

is a fixed kernel, and

{Conv}_{adapt}

produces a residual correction conditioned on latest appearance change. Thus, the feature map of future frames,

f_{app} \in R^{C \times H^{'} \times W^{'}} = K_{t} ⋆ f_{app}^{0}

, remains sensitive to the distribution among consecutive frames such as shadow shifts or pose deformation, enabling reliable appearance-based feature discrimination over time.

3.1.3. Temporal Context-Aware Enhancement

This stage refines the current frame features (i.e.,

f_{app}^{t}

and

f_{mot}^{t}

, both in

R^{C \times H^{'} \times W^{'}}

) by incorporating historical cues from the previous L frames. All the cues are collected as

F_{app}^{history} = {f_{app}^{t - 1}, f_{app}^{t - 2}, \dots, f_{app}^{t - L}}

and

F_{mot}^{history} = {f_{mot}^{t - 1}, f_{mot}^{t - 2}, \dots, f_{mot}^{t - L}}

, forming the temporal contexts that capture the appearance changes and motion consistency.

To refine the temporal contexts effectively, we employ a multi-head attention mechanism that enables the model to dynamically reweight relevant patterns from the past. Specifically, the enhanced appearance and motion features are computed as follows:

\begin{matrix} f_{app}^{*} = MultiHeadAttention (\begin{matrix} f_{app}^{t}, F_{app}^{history}, F_{app}^{history} \end{matrix}) \end{matrix}

(4)

\begin{matrix} f_{mot}^{*} = MultiHeadAttention (\begin{matrix} f_{mot}^{t}, F_{mot}^{history}, F_{mot}^{history} \end{matrix}) \end{matrix}

(5)

Additionally, the attention mechanism is defined as follows:

\begin{matrix} MultiHeadAttention (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}, \end{matrix}

(6)

\begin{matrix} {head}_{i} = Softmax (\frac{(Q W_{i}^{Q}) {(K W_{i}^{K})}^{⊤}}{\sqrt{d_{k}}}) (V W_{i}^{V}) \end{matrix}

(7)

where

d_{k} = C / h

, and h denotes the number of heads. Each attention head specializes in distinct appearance and motion clues, thereby enriching the current features with long-term contextual semantics.

3.2. Heterogeneous Feature Refinement

Given the high-order temporal context-aware representation of motion and appearance, this module involves the interaction and refinement between the heterogeneous features

f_{app}^{*}

and

f_{mot}^{*}

. Unlike conventional fusion representation that yields single modality output, we are devoted to obtaining two refined representations

f_{mot}^{refined}

and

f_{app}^{refined}

with bidirectional feature interaction, where the motion and appearance features guide each other.

3.2.1. Feature Projection Embedding

Due to the inconsistent feature distributions, it would be intractable to align the heterogeneous appearance and motion features. For the purpose of effective cross-domain interaction, we introduce two learnable linear projection matrices that embed the heterogeneous features into a shared Hilbert space, where the embeddings admit computational equivalence. The embedding process is formulated as

\begin{matrix} f_{app}^{proj} = W_{app} f_{app}^{*} + b_{app} \end{matrix}

(8)

\begin{matrix} f_{mot}^{proj} = W_{mot} f_{mot}^{*} + b_{mot} \end{matrix}

(9)

where

W_{app}, W_{mot} \in R^{C^{'} \times C}

denote learnable projection matrices,

b_{app}, b_{mot} \in R^{C^{'}}

represents bias vectors, and

C^{'}

specifies the dimension of the unified embedding space. The projections

f_{app}^{proj}, f_{mot}^{proj} \in R^{C^{'} \times H^{'} \times W^{'}}

are subsequently reshaped into sequences

R^{(H^{'} W^{'}) \times C^{'}}

for the following bidirectional cross-domain interaction refinement.

3.2.2. Bidirectional Cross-Domain Interaction

We employ a bidirectional cross-domain interaction that allows motion features

f_{mot}^{proj}

and appearance features

f_{app}^{proj}

to guide each other within a unified embedding space. Specific attention mapping, as illustrated in Figure 3, consists of two parallel branches: appearance-guided motion refinement and motion-guided appearance refinement.

Appearance-Guided Motion Refinement. In this branch, we treat the appearance projections

f_{app}^{proj}

as the query (i.e., Q) and the motion projections

f_{mot}^{proj}

as key and value (i.e., K and V), enabling appearance to guide the refinement of motion:

f_{mot}^{refined} = Attention (Q = f_{app}^{proj}, K = f_{mot}^{proj}, V = f_{mot}^{proj})

(10)

where the attention function is defined as

\begin{matrix} Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{C^{'}}}) V \end{matrix}

(11)

Equation (10) compares each appearance feature vector with all motion features, identifying regions where semantic contents and motion patterns are most correlated. Unlike weighted sum or concatenation, the cross-attention mechanism introduces a learnable alignment between the feature spaces. Appearance features, being structurally stable and semantically expressive, provide spatial priors that rectify the attention toward motion patterns consistent with the target’s visual consistency.

Motion-Guided Appearance Refinement. Conversely, we designate

f_{mot}^{proj}

as the query (i.e., Q) and

f_{app}^{proj}

as key and value (i.e., K and V), allowing motion features to guide appearance ones:

\begin{matrix} f_{app}^{refined} = Attention (Q = f_{mot}^{proj}, K = f_{app}^{proj}, V = f_{app}^{proj}) \end{matrix}

(12)

Since motion captures the continuity of object trajectories even under changing viewpoints or lighting conditions, it provides robust guidance for target association. Through mapping motion queries to appearance regions, the model could suppress visually misleading features (e.g., clutter or distractors) and attend to the trajectory-consistent targets.

3.3. Consistent Decision Fusion with Smooth Constraint

This decision-making module dynamically integrates the refined feature representation of motion and appearance under smooth optimization constraint so as to avoid unreasonable decision switches between the two heterogeneous visual experts. It follows a three-stage pipeline: dynamic feature gating for context-aware feature selection, adaptive decision fusion for temporal consistency, and failure-aware recovery against disruptions (e.g., long-term occlusions and interruptions), as shown in Figure 4.

3.3.1. Dynamic Feature Gating

To dynamically judge the appearance and motion cues in UAV tracking, we introduce a gating network that adaptively modulates the contributions of appearance features

f_{app}^{refined}

and motion features

f_{mot}^{refined}

. The gating is conditioned on two factors: normalized target velocity

V_{t} = \frac{| p_{t} - p_{t - 1} |}{Δ t}

and the appearance confidence

C_{t} = max (S_{init})

. Then, the induced gating weights,

α_{t}

and

β_{t}

, are expressed as

\begin{matrix} g_{t}^{mot} & = σ (W_{m} [V_{t}, \nabla V_{t}, C_{t - 1}] + b_{m}) \end{matrix}

(13)

\begin{matrix} g_{t}^{app} & = σ (W_{a} [C_{t}, \nabla C_{t}, V_{t - 1}] + b_{a}) \end{matrix}

(14)

\begin{matrix} α_{t} & = \frac{exp (g_{t}^{app})}{exp (g_{t}^{app}) + exp (g_{t}^{mot})} \end{matrix}

(15)

\begin{matrix} β_{t} & = 1 - α_{t} \end{matrix}

(16)

where

σ (\cdot)

is the sigmoid activation, learnable linear weights are

W_{m}

,

W_{a}

,

b_{m}

,

b_{a}

, and ∇ denotes the temporal gradient. The proposed dynamic feature gating mechanism offers three core advantages:

Robust context-driven feature selection. The motion gating response is more sensitive to short-term velocity changes, enabling swift adaptation to rapid movements, while the appearance gating favors stable scenes, ensuring visual consistency when motion is less informative.
Temporal smoothness. Incorporating temporal gradients suppresses erratic cue shifts when the video contents change slowly (e.g., cruise or hover mode).
Historical complementarity optimization. Leveraging historical confidence and velocity weighting, enabling the model to offset short-term degradation in either stream by relying on the complementary one.

3.3.2. Smooth Decision Fusion

Sudden fluctuation in the gating weights often results in erratic predictions of target states. Here, we introduce an adaptive decision fusion strategy that constrains the positions within physically plausible trajectories.

More specifically, the fused representation at time t integrates both current modality-specific features and historical fused ones:

f_{fused}^{t} = α_{t} f_{mot}^{refined} + β_{t} f_{app}^{refined} + \sum_{k = 1}^{K} γ_{k} f_{fused}^{t - k}

(17)

where

γ_{k} = λ e^{- k / τ}

(with

λ = 0.5

,

τ = 3

) controls the temporal decay rate, assigning greater weight to latest frames while gradually attenuating the influence of historical ones, and the fused features

f_{fused}^{t}

are utilized to generate the target response map

S_{t} (p)

, which encodes spatial confidence scores for candidate locations.

To stabilize the predicted position, we formulate tracking as a regularized optimization problem. At each timestep, the predicted position

p_{t}

is obtained by solving the following:

p_{t} = arg max_{p} (\sum_{i = 0}^{N - 1} w_{i} S_{t - i} (p) - η {∥ p - p_{t - 1} ∥}_{2}^{2})

(18)

where

S_{t - i} (p)

denotes the response map at time

t - i

, and the weights

w_{i} = 2^{- i} / \sum 2^{- k}

favor recent frames. Additionally, the spatial regularization term

η ∥ p - p_{t - 1} ∥_{2}^{2}

suppresses abrupt positional variation.

3.3.3. Failure-Aware Recovery

In the case of drastic appearance variations, we introduce two motion-guided strategies to admit continuing tracking:

Confidence-Aware Fallback: If the appearance cue is unreliable (e.g., $C_{t} < 0.4$ ), we preserve the gating weight from the previous frame ( $α_{t} \leftarrow α_{t - 1}$ ) and estimate the target’s position using a short-term average of recent velocities, $p_{t} = p_{t - 1} + \frac{1}{3} \sum_{i = 1}^{3} V_{t - i}$ . This strategy assumes that the target continues along a consistent path over a short period, effectively bridging visual disruptions.
Velocity-Guided Prediction: If the appearance confidence drops below 0.2 (e.g., severe occlusion), we update the position based on extrapolated motion using $p_{t} = p_{t - 1} + {\hat{V}}_{t}$ . Here, ${\hat{V}}_{t}$ is a weighted average of the past three velocities, computed as $EMA (V_{t - 3 : t - 1}, α = 0.9)$ . This method extends the motion trend to estimate the target’s location when the appearance cue is unavailable.

4. Experimental Results

In this section, we report the details of our experiment. To fully prove the effectiveness of our aerial tracker, we evaluate our method on four well-established aerial tracking benchmarks, i.e., UAV123 [33], UAV20L [33], UAV123@10fps [33], and DTB70 [34].

4.1. Dataset

UAV123 [33] consists of 123 sequences collected under diverse real-world conditions, including urban roads, industrial sites, highways, sports fields and rural areas, as shown in Figure 5a. It captures diverse challenges including lighting fluctuations, abrupt target motion, partial occlusion, object shape deformation, complex background, and size transformation. It features a wide variety of targets such as pedestrians, vehicles, cyclists, and groups and is representative of common UAV applications like surveillance, traffic monitoring, and infrastructure inspection.

UAV123@10fps [33] contains 123 aerial video sequences captured by UAV-mounted cameras at 10 frames per second (FPS), simulating real-world motion and processing constraints. It covers a wide range of scenarios including urban, rural, and aquatic environments and features various target types such as vehicles and pedestrians, as shown in Figure 5b. With over 113K frames annotated with bounding boxes and attribute tags, UAV123@10fps provides a challenging benchmark for evaluating tracker robustness under reduced frame rates and energy-efficient operational requirements.

UAV20L [33] consists of 20 long-duration UAV-captured sequences, each averaging nearly 3000 frames, with the longest exceeding 5000 frames. It covers a wide range of scenarios including expansive forests, rugged mountains, coastal regions, agricultural lands, and urban outskirts, as shown in Figure 5c. As the sequence length increases, challenges such as object deformation, scale variation, illumination changes, and viewpoint shifts become more pronounced. It focuses on long-term tracking challenges such as persistent occlusion, target reappearance, and background clutter, simulating real-world missions that require sustained attention over extended durations.

DTB70 [34] is a drone-tracking benchmark with 70 sequences capturing vehicles, pedestrians, and animals in diverse scenes, as shown in Figure 5d. It includes 41,000 annotated frames and poses challenges like fast motion, occlusion, and camera viewpoint changes, making it suitable for assessing tracker robustness under UAV-specific constraints such as motion blur and real-time processing. It emphasizes environmental and object diversity. It includes scenarios involving marine vessels, animals, and vehicles, captured across waterfronts, farmlands, urban intersections, and more dynamic outdoor scenes.

4.2. Evaluation Metrics

The aerial tracker is assessed by a one-pass evaluation across all the previously mentioned datasets, with precision and success rate as the primary evaluation criteria. The precision plot presents the percentage of bounding boxes whose center location error (CLE) is less than a given threshold. CLE is used to measure the Euclidean distance between the estimated object location and the ground-truth bounding box center. Additionally, the success rate is employed to measure the IoU of the estimated area and the ground-truth bounding box. The success plot indicates the ratio of the number of frames whose IoU is greater than the given threshold to the total number of frames. The area under the curve (AUC) is applied to rank the success rate of trackers.

4.3. Implementation Details

Our framework is implemented based on a temporally enhanced AlexNet backbone [35]. It integrates dynamic temporal convolutions (kernel size = (3, 1, 1)) for adaptive feature calibration and a motion branch with cascaded convolutions (11 × 11→5 × 5→3 × 3) to generate 384-channel motion descriptors. The training process employs a three-stage curriculum spanning 100 epochs on VID [36], LaSOT [37], and GOT-10K [38] datasets. These datasets are chosen for their complementary characteristics that align with UAV tracking requirements. VID provides diverse object categories with temporal annotations, essential for learning robust appearance representations across different target types commonly encountered in aerial footage (vehicles, persons, animals, and objects). LaSOT offers long-term tracking scenarios (average 2500 frames per sequence) that are crucial for learning motion pattern consistency and temporal context modeling required for extended UAV missions. GOT-10K ensures cross-category generalization through its one-shot evaluation protocol, preventing overfitting to specific object classes and enabling robust performance on novel targets encountered in real-world UAV applications. This learning strategy progressively builds from basic appearance discrimination (VID) through temporal consistency modeling (LaSOT) to general robustness (GOT-10K), aligning with the hierarchical nature of our heterogeneous feature fusion architecture.

We initialize learning rates at 0.005 with log-space decay to 0.00005, using SGD optimizer (momentum = 0.9, weight_decay = 0.0001) and batch size 32 distributed across two NVIDIA GeForce RTX2080Ti GPUs. In the first 10 epochs, we freeze the first three convolutional blocks, and training is focused on the motion head with a higher learning rate (

5 \times 10^{- 3}

). Subsequently, the core temporal adaptation layers are then unfrozen with layer-specific learning rate ratios, introducing temporal dropout (

p = 0.1

) and flow perturbation (

N (0, 0.1)

) to enhance temporal robustness.

4.4. Quantitative Evaluation

To evaluate the effectiveness of the proposed aerial tracker, we conducted extensive experiments on four UAV benchmarks: UAV123, UAV20L, UAV123@10fps, and DTB70. A total of 38 state-of-the-art (SOTA) trackers were included for fair comparison, covering spatio-temporal context methods (e.g., TCTrack [16] and DeepSTRCF [12]), Siamese-based methods (e.g., SiamAPN++ [11] and SiamDW [39]), DCF-based approaches (e.g., SRDCF [9] and KCF [8]), feature fusion trackers (e.g., ECO_gpu [6] and CCOT [40]), and motion-aware models (e.g., LCT [41] and ARCF [22]). The results are obtained by re-implementing the official codes with their corresponding hyperparameters.

As shown in Figure 6, our method consistently outperforms all compared trackers, achieving precision scores of 0.812, 0.737, 0.787, and 0.828 and success rates of 0.616, 0.552, 0.601, and 0.629 on the UAV123, UAV20L, UAV123@10fps, and DTB70 benchmarks, respectively. To provide comprehensive quantitative comparison, Table 1 presents detailed precision and success rates for all compared methods on the DTB70 benchmark. These results outperform all competing methods, demonstrating the effectiveness of our high-order temporal context-aware representation, which equally encodes motion patterns and appearance features within a unified architecture. This representation enables robust target perception under diverse aerial scenarios, effectively mitigating challenges such as occlusion, motion blur, temporal discontinuity, and background interference and significantly enhancing tracking stability and accuracy.

To further evaluate the robustness of our tracker under challenging conditions, we conduct additional tests on four representative scenarios selected from the UAV20L dataset, namely, full occlusion, partial occlusion, background clutter, and out-of-view. As shown in Figure 7, our method consistently achieves the best performance in both success rate and precision across all cases. Specifically, in full occlusion scenarios, our method achieves a 0.353 success rate compared to SiamAPN++’s 0.344, demonstrating the effectiveness of motion-guided recovery when appearance cues are completely unavailable. For partial occlusion, we maintain a 0.543 success rate with 0.735 precision, outperforming appearance-only methods like SiamAPN++’s 0.506 success rate. The fusion of high-order motion patterns with spatial features allows the tracker to infer plausible target positions even in the absence of reliable visual observations. Additionally, in more complex conditions where both appearance and motion cues are compromised, such as background clutter and out-of-view cases, our approach still exhibits superior performance compared to other trackers. In background clutter scenarios, our approach still achieves a 0.334 success rate, surpassing TCTrack by 0.053. For out-of-view cases, our failure-aware recovery enables 0.593 success rate. This is owing to the fact that our appearance–motion fusion mechanism enables the model to maintain robustness to tracking despite severe distractors or sudden disappearance of the target.

4.5. Qualitative Evaluation

As illustrated in Figure 8, we conduct a qualitative comparison between our aerial tracker and other leading methods on the UAV123 dataset.

To illustrate the effectiveness of our method under challenging tracking conditions, such as complex background, frequent short-term occlusions, and target coverage by other objects, we, respectively, select (a) Bike3, (b) Car9, and (c) Group2 as representative sequences. One can see that our method demonstrates robust tracking performance across these scenarios.

In the Bike3 sequence, the bicycle moves through a cluttered scene with pedestrians, poles, and other similar objects. These distractions could mislead appearance-based aerial trackers. Our method leverages MHIs to capture the target motion patterns, allowing us to distinguish the moving target from static distractions. As shown in Figure 8a, our tracker accurately tracks the target, while others drift toward the background.

In the Car9 sequence, the car is frequently occluded by billboards and obstacles. Our velocity-adaptive gating increases the influence of motion features during occlusion, using recent velocity to estimate position. Additionally, the spatio-temporal refinement ensures smooth tracking. As shown in Figure 8b, our tracker maintains accurate localization, avoiding the drift seen in appearance-only methods.

In the Group2 sequence, the person is partially hidden by branches, leaving limited appearance cues. Our method addresses this by refining motion and appearance features through bidirectional attention. As shown in Figure 8c, our tracker accurately recovers the bounding box, avoiding drift and scale errors common in appearance-based methods.

4.6. Ablation and Hyperparameter Sensitivity Study

To validate the effectiveness of each component in our proposed tracking framework, we conduct the ablation studies and hyperparameter sensitivity study on the UAV123, UAV20L, and DTB70 benchmarks. Specific analysis consists of four aspects: (1) component-wise validation of the overall results, (2) sensitivity analysis of MHI parameters, (3) exploration of cross-attention head numbers, and (4) evaluation of different smooth constraint parameters.

4.6.1. Component Ablation of Overall Framework

We analyze the contributions of (1) motion feature and MHI, (2) bidirectional cross-attention mechanisms, and (3) smooth constraint. Starting with a baseline model (appearance-only Siamese tracker), we progressively incorporate our proposed modules to quantify their individual and combined effects. Table 2 presents the results on UAV123 and DTB70 datasets. Adding our motion feature stream with MHI-based temporal encoding improves success rates by 0.016 and 0.015 on UAV123 and DTB70, respectively, highlighting the importance of explicit motion modeling. The bidirectional cross-attention mechanism further enhances performance by facilitating effective information exchange between appearance and motion streams. Finally, the spatio-temporal refinement module contributes an additional gain of 0.009 and 0.015 in success rates, demonstrating the importance of trajectory consistency constraints.

4.6.2. Sensitivity to MHI Parameters

Then, we investigate different configurations of our motion feature extraction module, specifically analyzing the impact of MHI parameters and temporal window size. Table 3 shows the effects of different decay parameters (

τ

) and temporal windows (L) on tracking performance under challenging scenarios.

The results demonstrate that motion-history encoding significantly improves tracking performance in challenging scenarios, particularly for fast motion (+0.114) and occlusion (+0.157) when using optimal parameters (

τ

= 3, L = 3). A decay factor of

τ

= 3 achieves the best balance between short-term responsiveness and long-term motion pattern preservation. Increasing the temporal window to L = 5 slightly decreases performance, likely due to the introduction of irrelevant historical motion when the direction of target movement changes.

4.6.3. Sensitivity to Cross-Attention Head Number

Additionally, we evaluate different bidirectional cross-domain interaction configurations in our bidirectional fusion module to understand how appearance and motion features should optimally interact. Table 4 shows that

h = 8

achieves best performance in complex background and viewpoint changes.

4.6.4. Sensitivity to Smooth Constraint Parameters

To analyze the impact of different parameters in our smooth constraint module during long-term tracking, we specifically conducted experiments on UAV20L, focusing on the exponential decay coefficient (

λ

) and regularization strength (

η

) for trajectory smoothness constraints. The results in Table 5 demonstrate how these parameters affect tracking performance in long-term tracking scenarios.

The results confirm that the constraints significantly enhance tracking robustness, particularly in challenging scenarios like occlusion recovery (+0.091) and motion blur (+0.057). The optimal configuration uses

λ

= 0.5 for temporal decay and

η

= 0.7 for spatial regularization, balancing historical trajectory information with current frame observations. Excessive regularization (

η

= 0.9) reduces adaptability to sudden motion changes, while insufficient constraints (

η

= 0.5) provide inadequate stability during occlusions.

5. Discussion

In this paper, we propose a high-order temporal context-aware tracker that incorporates a failure-aware recovery mechanism to alleviate common tracking failures. However, certain scenarios still pose considerable challenges to the failure-aware recovery mechanism. One such case arises when multiple similar-looking objects move at comparable speeds (e.g., vehicles in traffic). As seen in Figure 9a, both motion and appearance cues become ambiguous, making it difficult to maintain identity consistency. Since our velocity-based prediction relies on the assumption of consistent motion patterns, it may fail to distinguish between targets when such cues are unreliable or conflicting. Another limitation occurs under prolonged or complete occlusion. Without updated visual feedback, motion-based predictions rely solely on outdated cues. Over time, small errors accumulate, leading to temporal drift and unreliable recovery, especially when the target reappears at a position far from the predicted trajectory. This limitation is illustrated in Figure 9b, where the predicted position gradually drifts and fails to reconnect with the actual target once it re-emerges. These cases underscore the limitations of relying solely on local motion history. To address such challenges, future work will explore more robust long-term recovery strategies, such as incorporating global re-detection or re-identification modules.

In addition, the high-order temporal context-aware tracker, by integrating heterogeneous visual cues (appearance and motion) and temporal context-aware representation, provides a robust foundation for modeling dynamic targets with varying appearances in challenging aerial environments. Our design has enabled us to achieve highly effective single-target tracking (SOT), consistently maintaining target localization and minimizing drift amidst complex factors like viewpoint changes, occlusions, and rapid motion. However, the current framework is specifically tailored for SOT and lacks the fundamental components required for multiple object tracking (MOT). Extending our method to MOT directly presents significant challenges as it not only necessitates sophisticated spatial reasoning across multiple objects but also demands robust mechanisms for maintaining inter-frame identity consistency and effectively handling complex inter-target occlusions and interactions. Future work will, therefore, focus on adapting our core modules within a comprehensive MOT framework, potentially incorporating tracking-by-detection paradigms or graph-based association models to address these crucial multiple object complexities.

Lastly, we acknowledge that our tracker also introduces a non-negligible computational burden. While our aerial tracker achieves promising performance in challenging scenarios, it results in a computational cost of 24.6GFLOPs, exceeding the leading trackers such as SiamDW (12.9GFLOPs) [39] and SiamGAT (17.28GFLOPs) [27]. This complexity primarily stems from the additional optical flow computation and the convolutional operations involved in heterogeneous feature fusion. Nevertheless, all the computational cost of the aforementioned trackers remains within the same order of magnitude. Additionally, the computational cost of our method remains accommodated by the edge computing modules. Taking NVIDIA Jetson AGX Orin as an example, its FP32 computing throughput reaches 5.3 TFLOPs, which is two orders of magnitude higher than our model’s FLOPs. In other words, the proposed method consumes only 0.5 percent of available computation resources, leaving a substantial margin for a reliable implementation. In summary, we think this computational burden is worth it for real-time applications. Future work will focus on further reducing the computational burden for deployment on more resource-constrained aerial platforms without sacrificing accuracy.

6. Conclusions

In this work, we proposed a high-order temporal context-aware tracker to address the instability of visual tracking in UAV platform, particularly under challenging conditions such as cluttered background, severe occlusions, and drastic appearance variations. Specifically, we equally encoded appearance and motion cues and leveraged motion-history images to learn high-order temporal representations. Additionally, a bidirectional cross-attention mechanism is introduced to refine these heterogeneous features, yielding cohesive and complementary representation. Then, a velocity-based gating strategy, coupled with consistent inference with smooth spatial–temporal, was designed to guide decision fusion and ensure stable location. Finally, we conduct extensive experiments on four UAV benchmarks to validate the effectiveness and robustness of the proposed method.

Author Contributions

Conceptualization, S.Z.; methodology, X.F.; software, X.F.; validation, S.Z. and X.F.; formal analysis, S.Z.; investigation, Z.W.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, S.Z. and X.F.; writing—review and editing, S.Z., X.F. and Z.W.; visualization, X.F.; supervision, W.W.; project administration, S.Z.; funding acquisition, S.Z. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Natural Science Foundation of China (NSFC) under Grant 62201068, the School Foundation of BISTU under Grant 9162424116 and the Beijing Institute of Technology Young Faculty Academic Startup Program under Grant 3050012222525.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gaffey, C.; Bhardwaj, A. Applications of unmanned aerial vehicles in cryosphere: Latest advances and prospects. Remote Sens. 2020, 12, 948. [Google Scholar] [CrossRef]
Sishodia, R.P.; Ray, R.L.; Singh, S.K. Applications of remote sensing in precision agriculture: A review. Remote Sens. 2020, 12, 3136. [Google Scholar] [CrossRef]
Manfreda, S.; McCabe, M.F.; Miller, P.E.; Lucas, R.; Pajuelo Madrigal, V.; Mallinis, G.; Ben Dor, E.; Helman, D.; Estes, L.; Ciraolo, G.; et al. On the use of unmanned aerial systems for environmental monitoring. Remote Sens. 2018, 10, 641. [Google Scholar] [CrossRef]
Jorge, V.A.; Granada, R.; Maidana, R.G.; Jurak, D.A.; Heck, G.; Negreiros, A.P.; Dos Santos, D.H.; Gonçalves, L.M.; Amory, A.M. A survey on unmanned surface vehicles for disaster robotics: Main challenges and directions. Sensors 2019, 19, 702. [Google Scholar] [CrossRef] [PubMed]
Hamdi, Z.M.; Brandmeier, M.; Straub, C. Forest damage assessment using deep learning on high resolution remote sensing data. Remote Sens. 2019, 11, 1976. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Galoogahi, H.K.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1135–1143. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern. Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Danelljan, M.; Häger, G.; Khan, F.S.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1–7. [Google Scholar]
Li, F.; Tian, C.; Zuo, W.; Zhang, L.; Yang, M.H. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4904–4913. [Google Scholar]
Zhang, L.; Gonzalez-Garcia, A.; van de Weijer, J.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar]
Yang, T.; Chan, A.B. Learning dynamic memory networks for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 152–167. [Google Scholar]
Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. STMTrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14778–14788. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
Zhu, Z.; Wu, W.; Zou, W.; Yan, J. End-to-end flow correlation tracking with spatial-temporal attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 548–557. [Google Scholar]
Moudgil, A.; Gandhi, V. Long-term tracking in the wild: A benchmark. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 670–686. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4277–4286. [Google Scholar]
Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning aberrance repressed correlation filters for real-time UAV tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2891–2900. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Zhu, X.; Xiong, Y.; Dai, J.; Yuan, L.; Wei, Y. Deep feature flow for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2349–2358. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6728–6737. [Google Scholar]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
He, K.; Zhang, C.; Xie, S.; Li, Z.; Wang, Z. Target-aware tracking with long-term context attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 773–780. [Google Scholar]
Farnebäck, G. Two-Frame Motion Estimation Based on Polynomial Expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, 29 June–2 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 363–370. [Google Scholar]
Huang, Z.; Zhang, S.; Pan, L.; Qing, Z.; Tang, M.; Liu, Z.; Ang, M.H., Jr. TAda! Temporally-Adaptive Convolutions for Video Understanding. arXiv 2021, arXiv:2110.06178. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for UAV tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 445–461. [Google Scholar]
Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern. Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
Danelljan, M.; Robinson, A.; Shahbaz Khan, F.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 472–488. [Google Scholar]
Ma, C.; Yang, X.; Zhang, C.; Yang, M.H. Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5388–5396. [Google Scholar]
Wang, N.; Song, Y.; Ma, C.; Zhou, W.; Liu, W.; Li, H. Unsupervised deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1308–1317. [Google Scholar]
Ma, C.; Huang, J.B.; Yang, X.; Yang, M.H. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3074–3082. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4844–4853. [Google Scholar]
Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Siamese anchor proposal network for high-speed aerial tracking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 510–516. [Google Scholar]
Zhu, G.; Wang, J.; Wu, Y.; Lu, H. Collaborative Correlation Tracking. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 184.1–184.12. [Google Scholar]
Danelljan, M.; Häger, G.; Khan, F.S.; Felsberg, M. Discriminative scale space tracking. IEEE Trans. Pattern. Anal. Mach. Intell. 2016, 39, 1561–1575. [Google Scholar] [CrossRef] [PubMed]
Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1430–1438. [Google Scholar]
Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11923–11932. [Google Scholar]
Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 254–265. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Danelljan, M.; Shahbaz Khan, F.; Felsberg, M.; Van de Weijer, J. Adaptive color attributes for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1090–1097. [Google Scholar]
Wang, C.; Zhang, L.; Xie, L.; Yuan, J. Kernel cross-correlator. Proc. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
Li, F.; Yao, Y.; Li, P.; Zhang, D.; Zuo, W.; Yang, M.H. Integrating boundary and center correlation filters for visual tracking with aspect ratio variation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2001–2009. [Google Scholar]
Li, X.; Ma, C.; Wu, B.; He, Z.; Yang, M.H. Target-aware deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1369–1378. [Google Scholar]
Lukezic, A.; Vojir, T.; Čehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6309–6318. [Google Scholar]
Zhang, T.; Xu, C.; Yang, M.H. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4335–4343. [Google Scholar]

Figure 1. Illustration of scenarios where appearance cue is unreliable. (a) Drastic viewpoint variations, (b) illumination fluctuations in outdoor environments, (c) occlusion scenarios where appearance cue is partially or completely lost, and (d) background clutter with similar distractors.

Figure 2. Overview of our proposed aerial tracking framework (# represents the number of video’s frames. * represents the features processed by MultiHead Attention. The “red box” represents the target’s position.) It comprises three crucial components. (1) High-order temporal-aware representation captures dynamic motion patterns and appearance variations over consecutive frames, thus enabling robust feature encoding. (2) Heterogeneous feature refinement enhances the interaction between motion and appearance through bidirectional cross-domain attention. (3) Consistent decision fusion with smooth constraint adaptively balances the heterogeneous features via dynamic gating, and adaptive fusion, enforcing temporal coherence and enabling failure-aware recovery.

Figure 3. Diagram of our bidirectional cross-domain interaction (* represents the features processed by MultiHead Attention.) (a) Appearance-guided motion refinement and (b) motion-guided appearance refinement.

Figure 4. Consistent decision fusion module (“…” represents passing through a few frames). This module fuses the motion and appearance features through dynamic feature gating, smooth decision fusion, and failure-aware fallback to locate the target.

Figure 5. Sample examples of the various scenario form different datasets. (a) UAV123, (b) UAV123@fps, (c) UAV20L, and (d) DTB70.

Figure 6. Overall performance of all trackers on four well-known aerial tracking benchmarks, i.e., UAV123, UAV123@10fps, UAV20L, and DTB70. Our tracker achieves superior performance compared to other SOTA trackers.

Figure 7. Evaluation results under different challenging attributes on UAV20L, including full occlusion, partial occlusion, background clutter, and out of view. Our tracker outperforms all leading trackers in both precision and success rate.

Figure 8. Visual tracking results on UAV123.The figure consists of three subsequences, corresponding to challenges posed by (a) background clutters, (b) frequent short-term occlusion, and (c) target coverage by other objects.

Figure 9. Representative failure cases of the proposed tracker. (a) Identity switch caused by multiple similar-looking objects with comparable motion patterns. (b) Tracking drift under prolonged occlusion due to accumulated errors in motion-based prediction.

Table 1. Comparison of precision and success rates on DTB70 benchmark. Results presented in bold indicate superior performance.

Trackers	Succ.	Prec.	Trackers	Succ.	Prec.
Ours	0.629	0.828	UDT [42]	0.422	0.605
TCTrack [16]	0.621	0.815	ARCF_H [22]	0.416	0.618
SiamAPN++ [11]	0.594	0.792	CF2 [43]	0.415	0.645
SiamRPN [44]	0.586	0.794	MCCT_H [45]	0.405	0.606
SiamAPN [46]	0.585	0.786	BACF [7]	0.402	0.601
CCOT [40]	0.517	0.771	CoKCF [47]	0.378	0.592
DeepS0TRCF [12]	0.505	0.736	SRDCF [9]	0.363	0.707
ECO_gpu [6]	0.502	0.724	fDSST [48]	0.357	0.536
SiamDW [39]	0.492	0.727	Staple_CA [49]	0.351	0.535
MCCT [45]	0.484	0.707	SRDCFdecon [50]	0.351	0.506
AutoTrack [51]	0.478	0.698	SAMF_CA [52]	0.346	0.609
DASiamRPN [53]	0.474	0.684	SAMF [52]	0.340	0.524
ARCF [22]	0.472	0.671	CN [54]	0.315	0.514
UDTplus [42]	0.462	0.665	KCC [55]	0.291	0.442
IBCCF [56]	0.460	0.671	LCT [41]	0.282	0.464
TADT [57]	0.459	0.691	KCF [8]	0.280	0.470
ECO_HC [6]	0.453	0.645	DCF [58]	0.280	0.469
CSRDCF [58]	0.438	0.648	DSST [48]	0.276	0.521
STRCF [12]	0.437	0.651	Staple [49]	0.265	0.368
MCPF [59]	0.432	0.666

Table 2. Ablation studies on major components of our framework on UAV123 and DTB70 datasets. Results presented in bold indicate superior performance.

Components			UAV123		DTB70
Motion Feature Extraction	Heterogeneous Feature Refinement	Consistent Decision Fusion	Succ.	Prec.	Succ.	Prec.
×	×	×	0.575	0.765	0.586	0.794
✓	×	×	0.591	0.783	0.601	0.809
✓	✓	×	0.607	0.801	0.614	0.816
✓	✓	✓	0.616	0.812	0.629	0.828

Table 3. Performance comparison with different MHI parameters under challenging scenarios on UAV123 dataset. Results presented in bold indicate superior performance.

Configuration	Fast Motion		Occlusion		Overall
Configuration	Succ.	Prec.	Succ.	Prec.	Succ.	Prec.
MHI ( $τ$ = 0, L = 3)	0.459	0.665	0.370	0.595	0.594	0.785
MHI ( $τ$ = 1, L = 3)	0.537	0.752	0.438	0.687	0.602	0.796
MHI ( $τ$ = 3, L = 3)	0.573	0.786	0.527	0.735	0.616	0.812
MHI ( $τ$ = 5, L = 3)	0.551	0.761	0.489	0.702	0.606	0.801
MHI ( $τ$ = 3, L = 5)	0.565	0.773	0.503	0.718	0.609	0.805

Table 4. Different configurations of cross-attention head numbers under challenging scenarios on UAV123 dataset. Results presented in bold indicate superior performance.

Head Number	Background Clutter		Viewpoint Change		Overall
Head Number	Succ.	Prec.	Succ.	Prec.	Succ.	Prec.
4 heads	0.327	0.551	0.605	0.758	0.601	0.807
8 heads	0.398	0.581	0.632	0.821	0.616	0.812
12 heads	0.362	0.563	0.617	0.793	0.603	0.804

Table 5. Experiment results of different smooth constraint parameters with challenging scenes on UAV20L dataset. Results presented in bold indicate superior performance.

Parameters	Occlusion		Motion Blur		Overall
Parameters	Succ.	Prec.	Succ.	Prec.	Succ.	Prec.
$λ$ = 0, $η$ = 0	0.446	0.633	0.500	0.710	0.519	0.703
$λ$ = 0.3, $η$ = 0.7	0.492	0.687	0.537	0.761	0.535	0.728
$λ$ = 0.5, $η$ = 0.7	0.537	0.737	0.557	0.821	0.552	0.737
$λ$ = 0.7, $η$ = 0.7	0.514	0.713	0.525	0.783	0.539	0.721
$λ$ = 0.5, $η$ = 0.5	0.509	0.706	0.542	0.804	0.547	0.731
$λ$ = 0.5, $η$ = 0.9	0.493	0.682	0.507	0.771	0.523	0.718

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, S.; Fan, X.; Wang, Z.; Wang, W.; Zhang, Y. High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts. Remote Sens. 2025, 17, 2237. https://doi.org/10.3390/rs17132237

AMA Style

Zhou S, Fan X, Wang Z, Wang W, Zhang Y. High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts. Remote Sensing. 2025; 17(13):2237. https://doi.org/10.3390/rs17132237

Chicago/Turabian Style

Zhou, Shichao, Xiangpan Fan, Zhuowei Wang, Wenzheng Wang, and Yunpu Zhang. 2025. "High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts" Remote Sensing 17, no. 13: 2237. https://doi.org/10.3390/rs17132237

APA Style

Zhou, S., Fan, X., Wang, Z., Wang, W., & Zhang, Y. (2025). High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts. Remote Sensing, 17(13), 2237. https://doi.org/10.3390/rs17132237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Order Temporal Context-Aware Aerial Tracking with Heterogeneous Visual Experts

Abstract

1. Introduction

2. Related Works

2.1. Visual Tracking in Aerial Remote Sensing

2.2. Temporal Context in Visual Tracking

2.3. Motion Cue in Visual Tracking

2.4. Heterogeneous Feature Fusion in Visual Tracking

3. Proposed Method

3.1. High-Order Temporal-Aware Representation

3.1.1. Motion Representation

3.1.2. Appearance Representation

3.1.3. Temporal Context-Aware Enhancement

3.2. Heterogeneous Feature Refinement

3.2.1. Feature Projection Embedding

3.2.2. Bidirectional Cross-Domain Interaction

3.3. Consistent Decision Fusion with Smooth Constraint

3.3.1. Dynamic Feature Gating

3.3.2. Smooth Decision Fusion

3.3.3. Failure-Aware Recovery

4. Experimental Results

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Quantitative Evaluation

4.5. Qualitative Evaluation

4.6. Ablation and Hyperparameter Sensitivity Study

4.6.1. Component Ablation of Overall Framework

4.6.2. Sensitivity to MHI Parameters

4.6.3. Sensitivity to Cross-Attention Head Number

4.6.4. Sensitivity to Smooth Constraint Parameters

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI