In this section, the proposed TPC-Tracker is validated on both simulators and datasets. The visual tracking algorithm is further deployed on a vertical takeoff and landing fixed-wing platform with a flight speed of up to 20 m/s. Experimental results and analyses demonstrate the effectiveness of our method, showcasing its robustness in high-speed motion scenarios and verifying its practical applicability in real-world airborne systems.
5.5. Comparisons Under Latency-Aware Evaluation
To demonstrate the necessity of latency awareness, we evaluated the performance of 14 common trackers under both normal evaluation and latency-aware evaluation. The baselines are divided into two groups: Normal Tracker and Latency-aware Tracker. For the former, there are SiamFC [
20], SiamRPN [
21], Siammask [
23], MixformerV2 [
42], Hit [
9], TCTracker [
35], ARTracker [
43], DeconNet [
44], SeqTrack [
45], Mobilesiam-ST [
46], and TATracker-Deit [
1]. For the latter, there are PVT [
11] and PVT++ [
12].
We add a symbol after the metric to indicate that the metric is under latency-aware evaluation. For fairness, for all methods, the conventional metrics are tested on the NVIDIA RTX 4090, while the latency-aware metrics are tested on the NVIDIA ORIN NX. The top result is shown in red text. The second is shown in blue text.
UAV123. UAV123 [
16] is constructed for low-altitude UAVs remote sensing, which enables it to capture a wide variety of real-world scenarios. This dataset contains 123 video clips, covering diverse environments such as urban areas, rural landscapes, and natural settings.
As shown in
Table 3, it is evident that TPC-Tracker-Base stands out among all the trackers in terms of achieving the best
performance at 61.42 on the UAV123 dataset. While the Mixformer attains a higher
value at 70.41, which reflects its overall tracking accuracy in offline tracking, the
is only 52.1. Comparing the
and
, the decrease rate reflects the gap of inference latency. The TPC-Tracker-Tiny has the smallest decline range under Latency-Aware Evaluation compared with the normal
, which only decreases by 0.4%.
Visdrone-SOT. The VisDrone-SOT [
17] dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images. These data are captured by various drone-mounted cameras, covering a wide range of aspects and different conditions. It is also suitable for low-altitude remote sensing.
As indicated in
Table 3, the results are nearly the same as UAV123. Mixformer achieves the best
and
score under the normal evaluation but the
and
decline 27% and 29%, respectively. The reason is that Mixformer only infers images at 7 FPS on Jetson Orin NX. The TPC-Tracker-Base achieves the
and
respectively at 70.06% and 70.96%. The decrease rate is lower than the TPC-Tracker-Tiny’s though the fps is lower.
Guidance-UAV. The Guidance-UAV [
8] dataset which focuses on the visual guidance for UAVs, consists of 16 video clips formed by 3209 frames.
As shown in
Table 3, different trackers exhibit varying performance on this dataset. For instance, OsTrack achieves an
of 63.94% and an
of 52.44% with a decline rate of 0.18. This indicates its robustness in handling the complex visual cues and time-sensitive requirements of the Guidance-UAV. The TPC-Tracker-Base and TPC-Tracker-Large are both the best trackers on
. Meanwhile, the lowest decline rate is TPC-Tracker-Tiny.
Speed According to the result in
Table 3. The speed of trackers without predictors determined the decline rate for latency-aware evaluation. When the FPS is larger than
, the decline rate is about 5%, cause the results have only one frame latency. When the FPS is smaller than
, the smaller the FPS the larger the decline rate. It causes some trackers with high accuracy can not be used in real-time tracking like Mixformer.
Note. The Latency-aware metric and FPS only represent the tracker tested on the Nvidia Jetson Orin NX 8G. When the device is changed, the FPS and Latency-aware metrics are changed too.
5.6. Visualization on Predictive Tracking
To further understand the TPC-Tracker, as
Figure 8 shows, we visualize the tracking results of Latency trackers including TPC-Tracker, PVT, and PVT++ in Guidance-UAV. Our tracker is designed to capture not only visual temporal features but also global target motion information, going beyond just short-term motion. This unique capability allows for notably more accurate tracking, particularly when dealing with large and abrupt motion sequences.
As shown in
Figure 8, in the early stage of the sequences, the performance of the three predictive trackers is similar to that of the ground truth. This is because the object is far away from the UAV, and the relative motion within the frame is slow. However, in the later stage, it is different. The PVT is the worst method because it simply uses the Kalman Filter without taking the temporal features into account. Apparently, its tracking result is just similar to that of the last frame which is small and offset. Our TPC-Tracker stands out among all the methods as it can model the entire motion process.
5.8. Ablation Study and Analysis
5.8.1. Components Analysis
To verify the effectiveness of each module in our tracker, we removed the Visual Memory Module (VMM) and the M3 respectively on the basis of TPC-Tracker-Base. The result is shown in
Table 5. The VMM can significantly improve the
of target tracking, yet it cannot reduce the decline ratio. The Predictor cannot improve the
, but it can enhance the
by reducing the decline ratio.
In the case of the UAV123 dataset, when only the Predictor is added, while the AUC remains the same as the baseline without any module addition, the AUC@La shows a notable improvement, with the decline ratio decreasing from −0.05 to −0.02. This clearly demonstrates the Predictor’s capacity to better handle the target’s long-term associations and mitigate the performance degradation over time, even if it does not directly boost the initial AUC value.
On the other hand, when only the VMM is incorporated, we observe a significant jump in the from 58.34 to 63.32, validating its crucial role in capturing essential visual temporal cues that enhance the tracker’s accuracy in object identification and localization. However, as expected, the decline ratio remains relatively stable at −0.06, indicating that while it excels in improving the initial accuracy, it does not inherently possess the ability to counteract the subsequent performance drop as effectively as the M3.
When both the VMM and the M3 are combined, we achieve the best of both worlds. The reaches an impressive 61.42 with a decline ratio of just −0.02 and a solid of 63.32. This synergy showcases how the two modules complement each other, with the VMM laying the foundation for accurate initial tracking and the Predictor ensuring stability and enhanced performance as the sequence progresses.
A similar trend can be observed in the Guidance-UAV dataset. The addition of the Predictor leads to an improvement in and a reduction in the decline ratio, while the VMM boosts the overall . Once again, the combination of both modules yields the optimal performance metrics, underlining the importance of carefully designed and integrated components in developing a highly effective target tracker for diverse UAV applications. Future research could focus on further optimizing these modules or exploring additional features that could potentially enhance the tracker’s adaptability and precision in even more challenging scenarios during remote sensing tasks.
5.8.2. Experiments on Different Latency
Different cameras on the UAVs have different fps speeds when performing different remote sensing tasks. As a result, this paper conducted experiments to explore the results of predictive tracking at different speeds , which is the fps captured by the camera. The tracker is tested when the speed is 10, 20, 30, 40, 50, 60.
The specific approach is as follows: one thread plays at a speed of , and the other thread runs TPC-Tracker. The input of TPC-Tracker is the picture being played at the current moment, and the result of the current picture is the previous result output by the tracker at the moment when the picture started playing. There is no need for deliberate alignment.
As shown in
Table 6, this paper systematically conducted a series of experiments with the aim of exploring the results of predictive tracking under different speeds, denoted as, which specifically refers to the frames per second (fps) captured by the camera. The tracker’s performance was meticulously tested when the speed was set to 10, 20, 30, 40, 50, and 60 fps, respectively.
As clearly shown in
Table 6, which presents the detailed findings of this investigation, we can observe several notable trends. In the case of the UAV123 dataset, when the speed was at the relatively lower value of 10 fps, the tracker achieved an AUC of 63.32 and an AUC@La of 62.68 with a decline rate of only −0.01. This indicates a relatively stable performance in the initial stages of tracking, suggesting that the tracker was able to effectively capture and follow the target with minimal deviation. As the speed
increased to 20 fps, the
remained consistent at 63.32, while the
slightly decreased to 61.43 with a of −0.02.
When the is larger than the inference speed, the LAE still only decreases by 0.05. This indicates that the prediction module can effectively alleviate the problem of latency perception. It implies that the prediction module has a significant role in compensating for the potential negative impacts caused by the delay between the actual speed and the inference speed. Making reasonable predictions and adjustments helps to maintain a relatively stable performance level, reducing the performance degradation to a minimal extent even when facing such challenging situations where the speed exceeds the inference speed. This showcases the effectiveness and importance of the prediction module in enhancing the overall adaptability and reliability of the tracking system.
5.9. Real-World Remote Sensing Experiments
We also carried out real-world tests to further validate the practicality and robustness of our algorithm. The test site was a civilian airport, which provided a real-world and complex environment with various potential interferences such as wind, other flying objects, and ground structures. The experimental platform was a vertical takeoff and landing (VTOL) fixed-wing aircraft. This type of aircraft combines the advantages of vertical takeoff and landing like a helicopter and the high-efficiency long-distance flight performance of a fixed-wing aircraft. During the tests, the aircraft flew at an altitude of 70 m with an airspeed of 20 m per second, simulating high-speed and long-range flight scenarios that are common in practical applications. The network camera equipped on the aircraft used the H264 encoding format. To save transmission bandwidth, we set the bit rate to the lowest level. However, this decision led to some loss during the transmission and decoding process, resulting in blurred images. This real-world challenge of image quality degradation is typical in many practical scenarios, making our tests more representative. The results of the target tracking are presented in the following
Figure 9.
The proposed TPC-Tracker-Base demonstrates superior tracking performance across all metrics in real-world scenarios. As shown in
Table 7, it achieves state-of-the-art results with 62.11% AUC and 58.41% Norm.P, outperforming PVT++ by 1.12% and 1.93%, respectively. Notably, PVT++ exhibits a 0.91% AUC and 2.24% Norm.P improvement over its baseline PVT, indicating the effectiveness of algorithmic refinements in intermediate versions. Despite the blurred images caused by the low-bit-rate encoding and the high-speed movement of the aircraft, TPC-Tracker can still accurately predict and track the target in the aerial images captured by the high-speed moving UAV in the real world. Compared with the previous simulator tests, the real-world tests introduced more uncertainties and challenges. For example, the real-world wind conditions affected the stability of the aircraft’s flight, and the complex ground environment increased the difficulty of target recognition. However, TPC-Tracker showed excellent adaptability and robustness. It maintained a high-level tracking accuracy, which is crucial for practical applications such as search and rescue, surveillance, and inspection. In conclusion, through both the simulator tests and the real-world tests, we have fully demonstrated that TPC-Tracker is a highly effective and practical target-tracking algorithm. It can not only perform well in a controlled simulation environment but also show outstanding performance in complex and harsh real-world scenarios, providing a reliable solution for UAV-based target tracking applications.