You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

14 October 2025

Real-Time Occluded Target Detection and Collaborative Tracking Method for UAVs

,
,
and
1
School of Frontier Interdisciplinary Studies, Hunan University of Technology and Business, Changsha 410205, China
2
School of Intelligent Engineering and Intelligent Manufacturing, Hunan University of Technology and Business, Changsha 410205, China
3
Xiang Jiang Laboratory, Changsha 410205, China
4
School of Artificial Intelligence and Advanced Computing, Hunan University of Technology and Business, Changsha 410205, China
This article belongs to the Special Issue Digital Intelligence Technology and Applications, 2nd Edition

Abstract

To address the failure of unmanned aerial vehicle (UAV) target tracking caused by occlusion and limited field of view in dense low-altitude obstacle environments, this paper proposes a novel framework integrating occlusion-aware modeling and multi-UAV collaboration. A lightweight tracking model based on the Mamba backbone is developed, incorporating a Dilated Wavelet Receptive Field Enhancement Module (DWRFEM) to fuse multi-scale contextual features, significantly mitigating contour fragmentation and feature degradation under severe occlusion. A dual-branch feature optimization architecture is designed, combining the Distilled Tanh Activation with Context (DiTAC) activation function and Kolmogorov–Arnold Network (KAN) bottleneck layers to enhance discriminative feature representation. To overcome the limitations of single-UAV perception, a multi-UAV cooperative system is established. Ray intersection is employed to reduce localization uncertainty, while spherical sampling viewpoints are dynamically generated based on obstacle density. Safe trajectory planning is achieved using a Crested Porcupine Optimizer (CPO). Experiments on the Multi-Drone Multi-Target Tracking (MDMT) dataset demonstrate that the model achieves 84.1% average precision (AP) at 95 Frames Per Second (FPS), striking a favorable balance between speed and accuracy, making it suitable for edge deployment. Field tests with three collaborative UAVs show sustained target coverage in complex environments, outperforming traditional single-UAV approaches. This study provides a systematic solution for robust tracking in challenging low-altitude scenarios.

1. Introduction

Drones are widely employed in various domains, such as power systems, agriculture, disaster relief, logistics, intelligent transportation, and environmental protection [,,]. Their adoption is driven by superior maneuverability, flexible deployment capabilities, and expansive aerial perspectives [,,]. However, during low-altitude flight operations, complex obstacle environments frequently obstruct the drone’s field of view, not only degrading target tracking performance but also posing significant threats to flight safety.
The target tracking mission for drones comprises two distinct phases: target perception and tracking control. During the target perception phase, sensors including visible-light cameras, infrared cameras, and LiDAR collect environmental data [,,]. Leveraging computer vision techniques, this phase accomplishes target detection, recognition, and localization, thereby achieving dynamic perception and state estimation of the target. Single-Object Tracking (SOT), as the core technology of the perception phase, enables continuous localization of specific targets by dynamically modeling appearance variations and motion trajectories [].
During the tracking control phase, flight trajectories and gimbal attitudes are dynamically adjusted based on real-time perception results. Control algorithms including Proportional-Integral-Derivative (PID), Model Predictive Control (MPC), and Linear Quadratic Regulator (LQR) regulate flight velocity and pose to maintain optimal proximity and observational perspective toward the target, ensuring persistent tracking stability [,,]. When executing target tracking missions in low-altitude, obstacle-dense environments, drones frequently lose tracked objects due to targets becoming partially obscured. Although existing single-drone tracking methods can mitigate local occlusion challenges, they remain constrained by the limited perceptual coverage of individual platforms, hindering real-time observational perspective adjustments for highly maneuverable targets [,,]. Particularly during abrupt target maneuvers or within densely cluttered regions, the restricted sensing footprint of single drones inevitably causes targets to persistently fall out of the field of view.
Compared to a single UAV, multi-UAV systems significantly enhance the robustness and continuity of target tracking in complex obstacle-laden environments by leveraging multi-view synchronous perception and collaborative tracking mechanisms.

Key Points of the Article

To address the persistent challenges of target occlusion and limited field-of-view encountered by single drones in low-altitude, obstacle-dense environments, this paper proposes a multi-UAV Collaborative Occlusion-Aware Tracking Method:
  • To address the challenges of contour discontinuity and feature discriminability degradation in heavily occluded environments, an occlusion-robust tracking methodology is proposed. This approach employs a Mamba backbone architecture integrated with a DWRFEM, which amplifies local feature extraction capabilities. Multi-scale contextual information is fused via Wavelet Depthwise Separable Dilated Convolutions (WDSDConv), preserving holistic target contours while enhancing obscured edge detail resolution. Simultaneously, a dual-branch feature refinement framework incorporates DiTAC to elevate nonlinear feature representation, complemented by integrated KAN bottleneck layers for target saliency amplification. This methodology substantially mitigates occlusion-induced feature degradation, enabling sustained tracking robustness under severe occlusion scenarios.
  • To overcome tracking interruptions caused by individual drones’ field-of-view limitations and complete occlusions, a multi-UAV collaborative tracking framework is introduced. First, cooperative localization is established via multi-UAV ray intersection to reduce positioning uncertainty. Subsequently, an adaptive spherical sampling algorithm dynamically generates occlusion-free viewpoint distributions based on obstacle density, ensuring continuous target presence within the collective visual coverage. Finally, flight path optimization integrates the CPO for smooth trajectory generation, while real-time yaw angle adjustments maintain target-centering within the visual field.

3. Occlusion-Robust Target Tracking Methodology

Target occlusion in low-altitude complex obstacle environments frequently induces detection failures and target loss. To address this, we propose an occlusion-robust target tracking methodology based on a Mamba Backbone []. The framework comprises three core components (Figure 1): Mamba Backbone, Bottleneck Layer, and Detection Head. The Mamba Backbone integrates Visual State-Space (VSS) Block modules and Spatial Pyramid Pooling Fast (SPPF) modules. Within each VSS Block, the DWRFEM augments contextual semantic representation through multi-scale feature extraction, strengthening pixel-level feature discrimination. Template and search region features undergo fusion within the Bottleneck Layer, which employs a multi-level pyramid architecture grounded in self-attention mechanisms. This design facilitates hierarchical extraction and integration of global contextual features across spatial resolutions.
Figure 1. Framework of the proposed occlusion-robust target tracking methodology. The red boxes in the figure indicate the tracking detection results.

3.1. Mamba Backbone

The Mamba Backbone constitutes the core architecture of our tracking methodology. It captures long-range spatial dependencies across video frames—including positional and postural variations—enabling state prediction via target templates. This capability enhances tracking stability and temporal coherence while reducing target loss likelihood during transient occlusions. Furthermore, the network’s hierarchical feature extraction integrates local-to-global spatial semantics, improving adaptation to pose deformation, scale variation, and other complex scenarios. These attributes collectively enhance tracking precision and robustness. Architecturally, the Mamba Backbone stacks multiple VSS Blocks with Downsampling modules, culminating in an SPPF module for multi-scale feature fusion (Figure 2).
Figure 2. Mamba backbone network architecture.

3.1.1. VSS Block

The VSS Block serves as the fundamental component of the Mamba backbone network, primarily responsible for visual feature extraction and information propagation. It employs a 2D Selective Scan (SS2D) module to perform multi-directional scanning of input images, capturing contextual information and spatial features (Figure 3). To further enhance the model’s capacity for local feature extraction and contextual integration in visual data, we introduce a DWRFEM based on wavelet convolution []. This module augments depth-wise receptive fields through wavelet transformations, thereby improving feature representation capabilities.
Figure 3. VSS block networks architecture.
The DWRFEM incorporates dual principal pathways (Figure 4). The primary branch performs conventional feature extraction via convolutional operations. The secondary branch decomposes into three parallel sub-branches, each employing WDSDConv. By utilizing distinct dilation rates in the WDSDConv layers, these sub-branches extract features at different spatial granularities, thereby facilitating multi-scale feature fusion.
Figure 4. DWRFEM networks architecture.
The WDSDConv module (Figure 5) substitutes standard 3 × 3 convolution with depthwise dilated convolution (DDConv) and 1 × 1 convolution 41. DDConv applies a single convolutional filter per input channel with a specified dilation rate d. The output for channel c is computed as
Y ^ c = k = 1 K l = 1 K W c ( d ) k , l X c x + d k , y + d l
where W ( d ) R C i n × K × K ( C i n is the number of input channels, and C i n = 1 in DDConv) is the depthwise kernel, K is the kernel size (default K = 3 ), and d is the dilation rate.
Figure 5. WDSDConv architecture. WT stands for wavelet TRANSFORM and IWT stands for inverse wavelet transform.
WDSDConv delivers dual improvements: dilated convolution strategically expands the receptive field through kernel sparsity, while depthwise separable convolution reduces computational demands by factorizing operations into depthwise and pointwise components. WDSDConv preserves wavelet transformations’ multi-resolution analysis capabilities while simultaneously enhancing global contextual modeling efficiency and minimizing computational overhead.

3.1.2. MixPooling SPPF

The SPPF module captures multi-scale features through hierarchical pooling operations, thereby accommodating targets of varying sizes. To enhance generalization performance, we substitute the standard MaxPooling in SPPF with MixPooling [] (Figure 6). This strategic modification significantly improves model robustness when processing diverse input data distributions.
Figure 6. SPPF networks architecture.
MixPooling stochastically selects pooling operators during training, introducing adaptive uncertainty that compels the model to diversify feature representations. This mechanism mitigates over-reliance on specific features by dynamically alternating between maximum and average pooling. The operation is formally defined as
P i j = max R i j , if   δ = 0 avg R i j , if   δ = 1
where R i j denotes elements in the pooling region, and δ is a stochastic selector determining the pooling modality.

3.2. Bottleneck Layer

The Bottleneck Layer employs a multi-layer pyramidal architecture based on self-attention mechanisms (Figure 7) []. Its primary function involves performing deep global feature extraction and fusion between template image features and search region features, thereby generating more discriminative fused representations.
Figure 7. Bottleneck layer networks architecture.

3.2.1. Feature Extraction with MHA

Self-attention mechanisms have gained widespread adoption in computer vision for their capacity to capture long-range dependencies. By computing cross-position feature correlations, they enhance feature representational power through contextual integration. Within the bottleneck layer architecture, this capability is primarily implemented via two core modules: the Multi-Head Attention (MHA) and Scaled Attention (SA) mechanisms. The MHA operation is formally defined as
Attn Q , K , V , B = softmax Q K T d k + B V
H i = Hardswish Attn X i n p u t W i Q , X i n p u t W i K , X i n p u t W i V , B i
MHA X i n p u t = Concat H 1 , H 2 , , H N W O
where X i n p u t denotes input features, B i represents positional bias, N specifies the number of attention heads, and W i Q , W i K ,   W i V ,   W O correspond to learnable weight matrices.
To enhance model performance, we refine both SA and MHA modules by substituting Hardswish activations with the DiTAC function []:
DiTAC x = x ˜ · F x
x ˜ = T θ x If   a   x b x Otherwise
where x denotes the input, F(x) represents the cumulative distribution function of the standard normal distribution, and T θ x constitutes a learnable diffeomorphic transformation defined on domain a , b . The DiTAC activation function, as a trainable activation mechanism based on diffeomorphic transformations, provides richer nonlinear characteristics compared to Hardswish. By applying diffeomorphic transformations across varying domains, DiTAC adaptively reconfigures its activation profile to better accommodate diverse feature distributions. This modification not only enhances the model’s feature representational capacities but also effectively mitigates overfitting while improving generalization performance.

3.2.2. Feature Enhancement and Fusion with KAN

To strengthen nonlinear fitting capabilities, we implement KAN [] within the Bottleneck Layer as a substitute for conventional Multi-Layer Perceptron (MLP) modules.
MLP modules utilize fixed activation functions to perform nonlinear transformations at each node. Their output is obtained by applying activation functions to the product of inputs and weight matrices (Figure 8a):
MLP x = W 2 σ W 1 x
where W 1 and W 2 are linear weight matrices, and σ denotes a fixed activation function. In contrast, KAN places learnable activation functions on edges (weights), enabling dynamic adaptation to data distributions (Figure 8b):
KAN x = ψ 2 ψ 1 x
where ψ 1 and ψ 2 are function matrices comprising learnable activation functions. Each activation function is parameterized via spline curves, enabling dynamic adaptation to data distributions. we adopt cubic B-splines (degree = 3 ) and incorporate a base SiLU activation function for stable training:
ψ x = Silu x + g = 1 G ε g B g x
where B g denotes the g-th B-spline basis function of a fixed order, E g are the trainable coefficients that define the shape of the activation function, G is the number of basis functions (we set G = 5 ). By incorporating the KAN module, the model enhances nonlinear fitting capabilities while maintaining computational efficiency.
Figure 8. KAN architecture.

4. Multi-Perspective Collaborative Tracking with Multi-UAV

4.1. Collaborative Target Localization with Multi-UAV

This paper employs multi-UAV collaborative localization to determine target positions. Given known UAV positions, the relative angle between target and UAV is determined based on the bounding box centroid in the first-view perspective and the camera model. Combining UAV position, orientation, and this relative angle yields ray l k from target to UAV k in the world coordinate system. For any two UAVs k 1 and k 2 , point p k 1 k 2 satisfies the minimization of summed distances from p k 1 k 2 to points l k 1 and l k 2 :
min D l k 1 , p k 1 k 2 + D l k 2 , p k 1 k 2
where D l , p represents the distance from observation position p to ray l . This is geometrically equivalent to finding the ‘midpoint’ of the shortest segment connecting the two non-coplanar rays, providing a best-fit intersection point for this UAV pair. The final target position is then determined by fusing all such pairwise estimates:
min a , b 1 , 2 , , m , a < b D p t a r g e t , p k a k b
where D p 1 , p 2 denotes the distance between two points, and m represents the number of UAVs. This method, known as finding the geometric median, enhances robustness against outliers and measurement noise from any single UAV. For target trajectory prediction, B-spline curves are employed to fit historical observations of target positions, and a Kalman filtering algorithm is utilized to obtain future waypoints.

4.2. Spherical Visibility Sampling with Viewpoint Optimization

In dynamic environments, target motion exhibits high randomness and unpredictability. Abrupt directional changes may cause traditional tracking methods to lose targets. To address this challenge, the multi-UAV system must acquire all observable regions surrounding the target in real-time, enabling dynamic perception and comprehensive coverage. This paper presents an obstacle-aware visible region generation method based on spherical sampling to determine multi-UAV observation points.
Given known map and obstacle data, a sampling sphere S centered at target position p t a r g e t ϵ R 3 is constructed. Its radius r is determined by the UAV’s maximum observation range and environmental safety margin:
r = min ( d max , min p t a r g e t o 2 δ s a f e o O o b s )
where d m a x denotes the predefined maximum observation distance, O o b s represents the obstacle set, and δ s a f e is the safety buffer distance (defaulting to δ s a f e = 0.5   m ). Equation (13) dynamically defines a safety-aware sphere that ensures both observation validity and flight safety. The spherical sampling point set s i i = 1 N is generated via spherical coordinate parameterization:
s i = p t a r g e t + r sin θ i cos ϕ i sin θ i sin ϕ i cos θ i , ϕ i [ 0 , 2 π ] , θ i [ 0 , π ]
where ϕ i is the azimuth angle, θ i denotes the zenith angle. Equation (14) generates a uniform set of candidate viewpoints on the surface of the aforementioned sphere. The number of sampling points N s is adaptively adjusted based on environmental complexity and the environmental complexity is quantitatively measured by the density of obstacles within the sphere:
N s = N b a s e 1 + η O o b s B ( p t a r g e t , r ) 4 π r 2 / A g r i d
where N b a s e is the base sampling count (defaulting to N b a s e = 100 ), O o b s B p t a r g e t , r denotes the number of occupied grids by obstacles within the sphere, Δ A g r i d represents the grid map resolution, and η is the density sensitivity coefficient ( η = 0.5 ). Equation (15) enables the algorithm to dynamically adapt the sampling density, allocating more computational resources for fine-grained sensing in complex environments.

4.2.1. Line-of-Sight Reachability Detection

For each sampling point s i , collision detection is performed between line p t a r g e t s i and obstacles:
V i s i b l e ( s i ) = 1 i f r p t a r g e t s i ¯ , r O o b s 0 o t h e r w i s e
where O o b s represents the obstacle set. Equation (16) checks if the entire segment between the target and the candidate viewpoint lies in free space, ensuring an unobstructed view for observation. Line-of-sight reachability is determined via ray-triangle intersection detection:
r t = p t a r g e t + t ( s i p t a r g e t ) , t [ 0 , 1 ]
Occlusion is determined if the ray intersects any obstacle triangle patch. Equation (17) parameterizes the continuous line segment from the target point to the sampling point as a ray, forming a computable mathematical path for collision detection algorithms.

4.2.2. Visibility Volume Fusion

The visible sampling point set s i v i s is projected onto the tangent plane at the target point (with normal vector n t aligned with the gravity direction), generating a two-dimensional point set v i i = 1 M :
v i = s i v i s [ ( s i v i s p t a r g e t ) n t ] n t
The convex hull H of point set v i is computed, and its boundary vertex sequence V k k = 1 K defines the visible region boundary. These convex hull vertices are converted to polar coordinates ρ k , α k , with visibility sectors generated by coalescing continuous angular intervals:
Ω j = { α | α j s t a r t α α j e n d } , j = 1 , , L
Region size is quantified by the central angle α j = α j e n d α j s t a r t .

4.2.3. Adaptive Sampling Optimization

To enhance computational efficiency, this paper introduces an importance sampling strategy. We first reduce sampling density in obstacle-dense directions by defining a directional weighting function:
w ( ϕ , θ ) = exp ( λ d o b s ( ϕ , θ ) )
where d o b s ϕ , θ denotes the Euclidean distance to the nearest obstacle in direction ϕ , θ , and λ is the attenuation coefficient (default λ = 0.1 ). Sampling probability is adjusted as follows:
P ( ϕ i , θ i ) = w ( ϕ i , θ i ) j = 1 N w ( ϕ j , θ j )
Sampling points are selected via roulette wheel selection to avoid inefficient resource allocation caused by uniform sampling.
The spherical-sampling-based visible region generation and viewpoint optimization algorithm is implemented in Algorithm 1.
Algorithm 1: Spherical-sampling-based visible region generation and viewpoint optimization algorithm
Input: Target position point p t a r g r t , obstacle set O o b s , maximum observation distance d m a x , UAV count m, base sampling count N b a s e , density sensitivity coefficient η , grid map resolution Δ A g r i d .
Output: Observation point collection p o b v .
1: Initialize visible point set V = ;
2: Compute adaptive sample count N;
3: FOR i = 1 to N do
4:       Generate sampling point s i on sphere (center p t a r g r t ) via spherical parametrization;
5:       IF ray p t a r g r t s i is unobstructed THEN
6:        Add s i to V ;
7:    END IF
8: END FOR
9: Project V onto tangent plane at p , obtain 2D point set v i i = i M ;
10: Compute convex hull of v i ;
11: Merge angular intervals of the convex hull vertices to generate visible regions Ω j ;
12: IF m ≤ number of visible regions THEN
13:   Generate observation points at centroids of Ω j , store in p o b v ;
14: ELSE
15:   Use Particle Swarm Optimization (PSO) to allocate m observation points within Ω j , store in p o b v ;
16:END IF
17: RETURN p o b v ;

4.3. Multi-UAV Trajectory Planning

Based on the multiple observation points generated in Section 3.2, this paper employs linear programming to assign each observation point to a specific UAV. The optimization objectives include: (1) minimizing total path costs for all UAVs; (2) ensuring flight altitude compliance with safety constraints to avoid collisions with terrain or obstacles; (3) maintaining minimum safety distances between UAVs. Formally, the multi-UAV waypoint assignment objective is expressed as
J ( X p a t h ) = i m x i k ( w 1 F p a t h + w 2 F h e i g h t ) + η V i o l a t i o n ( P )
F p a t h = k = 1 N 1 D u i , k + 1 , u i , k
F h e i g h t = k = 1 N max ( 0 , h min z i , k , z i , k h max ) 2
D u i , u j d s a f e , i j
s . t . k = 1 n x i k = 1 , i
i = 1 n x i k = 1 , k
x i k { 0 , 1 } , i , k
where X p a t h is the path parameter matrix, η denotes the constraint penalty coefficient, w 1 and w 2 are weighting parameters, F p a t h represents path length cost (i.e., Euclidean distance between two points), F h e i g h t is flight altitude cost; x i k constitutes the decision variable where x i k = 1 indicates the start point i and target point k pairing, and x i k = 0 otherwise; Equations (26)–(28) ensure one-to-one correspondence between UAVs and observation points.
Subsequently, this paper employs a multi-UAV path planning algorithm based on the CPO to assign each UAV to a corresponding observation point. The multi-UAV path planning algorithm using CPO is presented in Algorithm 2.
Algorithm 2: Multi-UAV path planning via crested porcupine optimizer
Input: Start point set p i s t a r t , observation point set p i o b v , threat model O o b s , weight vector w
Output: Optimal path set p
1: Initialize crested porcupine path parameter population P = X s s = 1 S ;
2:FOR iter=1 to MaxIter do
3:  FOR each individual X s in P do
4:     Compute path P s via Dubins( X s );
5:     Calculate cost: J s = i = 1 m J p s i ;
6:     Apply constraint penalty: J s = J s + η V i o l a t i o n P s   ( η : penalty coefficient);
7:  END FOR
8:  Determine leader: X l e a d e r = argmin( J s )
9:  FOR each individual X s in P do
10:      IF rand() < P e x p l o r e THEN
11:        Random exploration: X s = X s + α X r a n d X s ( α : step size, X r a n d : random individual);
12:      ELSE
13:          Threat-driven adjustment: X s = X s + β X l e a d e r X s + γ Δ X t h r e a t ( β and γ are weights, Δ X t h r e a t is derived from threat intelligence);
14:      END IF
15:   END FOR
16:   Update population P : Combine the incumbent population and newly generated individuals;
17: END FOR
18: return Leader solution X l e a d e r (convert to path sets p );
B-spline curves are subsequently employed to fit the optimized waypoints, ensuring kinematic feasibility:
p ( μ ) = i = 0 n c i B i , k ( μ ) , μ [ 0 , 1 ]
where B i , k μ denotes the k-th order B-spline basis function, and c i represents the control points.

5. Experiments

5.1. Evaluation of Occlusion-Robust Target Tracking Methodology

This paper conducts algorithm training and model testing on the MDMT dataset []. The dataset contains data collected by two UAVs at varying flight altitudes and viewpoints, focusing on vehicle targets. MDMT comprises 44 video sequence pairs totaling 39,678 frames, with 11,454 distinct IDs (pedestrians, bicycles, and cars) and 2,204,620 bounding boxes—543,444 of which contain occluded targets. The dataset partitions these sequences into 25 pairs for training, 14 for testing, and 5 for validation. Although designed for multi-object tracking, MDMT also supports single-object tracking research. Crucially, its multi-perspective data provides rich information that enhances model capability in recognizing occluded targets, thereby supporting multi-UAV collaborative tracking tasks.
Experiments were conducted on Rocky Linux 8.9 with two NVIDIA A100-SXM4 GPUs (40GB VRAM each), manufactured by NVIDIA Corporation, headquartered in Santa Clara, California, USA. The network model was built using the PyTorch framework version 2.4.0. The input size of target templates was resized to 128 × 128 pixels, while the search region was resized to 256 × 256 pixels. Given the absence of abrupt motion changes in targets, the search region for the next frame was constrained to an area five times the bounding box size centered at the previous target position, without exceeding image boundaries. Training parameters were configured as follows: batch size 16, total epochs 500, learning rate 0.0001, weight decay 0.0001, and Adam optimization strategy.
The AP is adopted as the key metric to comprehensively evaluate the localization accuracy of the proposed model. This metric is computed as the arithmetic mean of precision values obtained at 100 different Intersection over Union (IoU) thresholds, ranging from 0.00 to 1.00 with a step size of 0.01. This approach mitigates the arbitrariness of selecting a single threshold and provides a more robust assessment of model performance across various levels of localization strictness. The AP is calculated as follows:
A P = 1 100 i = 1 100 P i
where P i represents the precision at the i-th IoU threshold.
Using our proposed model as the baseline, we designed five ablation studies. We first replaced the backbone network with ResNet18 and ResNet50 for comparison (Table 1). The results show that the Mamba backbone adopted in this work significantly improves accuracy by 5.0 percentage points compared to ResNet18, while maintaining higher processing speed. When compared to ResNet50, it achieves a slight accuracy improvement of 0.3 percentage points, along with a remarkable 171.4% increase in FPS. These experimental results demonstrate the advantage of our Mamba backbone in modeling long-range spatial dependencies.
Table 1. Ablation results of backbone networks.
According to the ablation results shown in Table 2, replacing the proposed DWRFEM with standard DWConv in the VSS Block leads to a noticeable performance-efficiency trade-off. While DWConv achieves higher computational efficiency at 103 FPS, it results in a reduction of 3.0 percentage points in AP. These results demonstrate the effectiveness of the DWRFEM in enhancing feature representation capabilities. Although introducing additional computational overhead, the module improves tracking accuracy by capturing features at different spatial granularities and expanding the receptive field through wavelet transformation.
Table 2. Ablation results of the VSS block.
According to the ablation experimental results of the Mixpooling SPPF shown in Table 3, the model achieved an average precision of 82.4% when the SPPF module was removed (None). After introducing the hybrid pooling SPPF module, the precision increased by 1.7%. The results indicate that the multi-scale feature extraction path constructed by SPPF enhances the model’s robustness in handling targets with scale variations, demonstrating the effectiveness of this module in improving feature representation capabilities.
Table 3. Ablation results of the mixpooling SPPF.
According to the ablation experimental results of activation functions shown in Table 4, adopting the DiTAC activation function led to a 1.2 percentage point increase in AP compared to using the Hardswish activation function. The results indicate that, compared to Hardswish, DiTAC more effectively models multi-scale feature dependencies, enhances the model’s adaptability to complex feature distributions, and ultimately improves both feature representation capability and generalization performance while maintaining computational efficiency.
Table 4. Ablation results of activation functions.
According to the ablation experimental results of the bottleneck layer architecture shown in Table 5, adopting the KAN module led to a 0.9 percentage point improvement compared to the traditional MLP structure. This enhancement can be attributed to KAN’s learnable activation function design, which enables adaptive fitting to data distributions and strengthens the model’s nonlinear representation capabilities.
Table 5. Ablation results of bottleneck architectures.
Deep learning-based object detection and tracking models often suffer from high parameter counts and insufficient real-time performance, particularly on mobile devices such as UAVs. For comparative experiments, this paper exclusively considers algorithms with compact model sizes and excellent real-time capability. Evaluation results on the MDMT validation set (Figure 9) demonstrate that our method achieves higher success rates than DiMP18 (DiMP with ResNet18 backbone), though marginally underperforming the E.T.Tracker, TransT-N2 and LoRAT-L-224.
Figure 9. Threshold-precision curve on MDMT test set.
Comparative experimental results are presented in Figure 10. Experimental results (Figure 10) show that, compared to current mainstream real-time tracking methods, while the proposed approach does not achieve optimal performance in any single metric, it exhibits outstanding comprehensive trade-offs between accuracy and speed. Specifically, compared to the E.T.Tracker method, our approach achieves a 111% frame rate improvement with only a 0.7% loss in AP, increasing the processing speed significantly from 45 FPS to approximately 95 FPS. When compared to Transformer-based TransT methods, the Mamba architecture adopted in our method effectively avoids the quadratic computational complexity of Transformers, providing substantial advantages in computational efficiency. Although there is a slight deficiency in detection accuracy, the significantly improved inference speed makes it particularly suitable for deployment scenarios on resource-constrained edge devices. This balance between accuracy and efficiency holds important practical value for edge computing applications.
Figure 10. Comparative Experimental Results.
To evaluate the model’s robustness, we designed an occlusion experiment (Table 6). The results show that our model achieved an AP of 82.9%, which is 1.7 percentage points lower than the top-performing LoRAT-L-224 model but significantly outperforms traditional methods such as KCF and DiMP18. This outcome demonstrates that our model exhibits strong robustness in handling occlusion challenges.
Table 6. Model performance comparison under occlusion.
To validate the real-time performance and computational resource utilization of the proposed method on edge devices, a Sophgo SE5 edge computing box was deployed at a DJI Airport. The SE5 features an octa-core ARM A53 processor operating at 2.3 GHz, 12 GB RAM, 32 GB eMMC storage, and supports simultaneous 16-channel HD video decoding with intelligent analysis. Additional capabilities include hardware decoding for 38-channel 1080p HD video and 2-channel encoding. The model was first converted to the bmodel format required by Sophgo edge devices, then deployed for object detection and tracking. Resource utilization metrics (Table 7) show our method achieves 39 FPS with 32.8% TPU utilization, demonstrating real-time operation with low computational overhead on the SE5 platform.
Table 7. Resource Utilization.
To further validate the method’s effectiveness in real-world complex scenarios, field tests were conducted using cameras mounted on DJI Dock UAV platforms in low-altitude environments with dense vegetation. During missions, HD video footage captured by UAVs was transmitted in real-time to the dock base station. The dock system then transferred the video stream via internal networks to the deployed Sophgo SE5 edge computing box, which executed the proposed method for real-time processing. Experimental results (Figure 11) demonstrate that our algorithm maintains robust target tracking even in heavily occluded environments with dense foliage cover.
Figure 11. Flight experiments.
To further validate detection and tracking performance under partial occlusion, qualitative comparative experiments were conducted as shown in Figure 12. The KCF algorithm fails to track targets when occlusion exceeds 50%, whereas our method, E.T.Tracker, and DiMP18 maintain stable tracking under such conditions.
Figure 12. Qualitative comparative results. The red boxes in the figure indicate the tracking detection results.

5.2. Evaluation of Multi-UAV Collaborative Tracking Methodology

To validate the efficacy of the proposed multi-UAV collaborative target localization method, three UAVs were statically deployed within a two-dimensional plane spanning −6 m to 6 m along both X- and Y-axes. Each UAV is equipped with GPS modules providing μs-level timestamp synchronization through PPS signals, ensuring coordinated perception across the swarm. An unmanned ground vehicle (UGV) traversing linearly from start to endpoint served as the target. Experimental results (Figure 13) indicate high congruence between predicted and ground-truth trajectories despite minor deviations, confirming substantial agreement between estimated and actual target positions. Further localization error analysis (Figure 14) quantifies discrepancies between collaborative positioning results and ground-truth locations. The error distribution demonstrates that 94.7% of measurements remain below 0.08 m, with sporadic peaks not exceeding 0.10 m. Critically, over 98% of errors are constrained within 10 cm, validating the effectiveness of our multi-UAV collaborative localization framework.
Figure 13. Multi-UAV collaborative target localization results.
Figure 14. Multi-UAV collaborative target localization error.
To validate the performance of the proposed multi-UAV collaborative tracking method, experiments were conducted in complex obstacle-rich scenarios using a moving ground vehicle as the target. Three UAVs were deployed to collaboratively track the vehicle. Figure 15 illustrates positional relationships between the target vehicle and observation UAVs at multiple timestamps, clearly showing light-blue occlusion-free zones and multi-angle coverage formed around the target. Figure 16 presents complete trajectories of both target and UAVs during tracking, while Figure 17 displays actual scene imagery. Results demonstrate that our method effectively overcomes field-of-view limitations of single UAVs in obstructed environments, ensuring continuous target tracking. The obstacle-avoiding trajectories dynamically converging toward target areas confirm the effectiveness and robustness of the collaborative tracking framework in complex scenarios.
Figure 15. Multi-UAV collaborative tracking results.
Figure 16. Multi-UAV collaborative tracking trajectories.
Figure 17. Multi-UAV collaborative tracking flight experiments. The red boxes represent the positions of the UAVs.
Quantitative evaluation further validates these advantages. As shown in Table 8, our method achieves a tracking loss rate of only 3.7% and an average positioning error of 0.03 m in such challenging environments, demonstrating both high tracking reliability and exceptional positioning accuracy.
Table 8. Quantitative evaluation of Multi-UAV Collaborative Tracking.
We compared the optimization performance of CPO and PSO in a multi-UAV cooperative tracking task (Figure 18). The experimental results indicate that PSO exhibits a rapid decline in cost during the initial iterations but enters a plateau around the 400th iteration, eventually converging to a relatively high cost level. In contrast, although CPO shows slower convergence in the early stage, it continues to optimize under the guidance of safety constraints and surpasses PSO in the later iterations, achieving a lower cost value. These results demonstrate that CPO can effectively escape local minima by leveraging constraint mechanisms and find a solution closer to the global optimum, thereby validating its comprehensive advantage in balancing safety and performance in complex multi-UAV cooperative tracking tasks.
Figure 18. Comparison of optimization algorithms in Multi-UAV collaborative tracking.
To evaluate the real-time performance of the algorithms, we conducted a quantitative comparative experiment on running time. The experiment selected 10 different routes and performed tracking control at 100 ms intervals. The average running time of both CPO and PSO algorithms on each route was calculated (Table 9). The experimental results indicate that the average running time of CPO is slightly higher than that of PSO, yet both remain within 100 ms, demonstrating satisfactory real-time computational capability. In complex environments with obstacles, this trade-off of slightly longer computation for a lower-cost, higher-quality path is justified.
Table 9. Comparison experimental of average running time.

6. Conclusions

In this work, we propose an occlusion-aware multi-UAV collaborative target tracking framework for low-altitude dense obstacle environments. First, an occlusion-robust tracking method based on a Mamba backbone network is developed, integrating a Dilated Wave-based DWRFEM and dual-branch feature refinement framework. This enhances local feature extraction and multi-scale contextual fusion capabilities, effectively addressing discontinuous contours and feature degradation under severe occlusion. Additionally, a multi-UAV collaborative system is constructed to achieve cooperative target localization through multi-UAV ray intersection, generate occlusion-free viewpoints via obstacle-density-adaptive spherical sampling, and optimize flight trajectories using the CPO. Experimental results on the MDMT dataset demonstrate that the proposed tracker achieves a high processing speed while maintaining competitive accuracy, attaining a superior balance between speed and precision compared to other models, which makes it well-suited for edge deployment. Field tests confirm that the multi-UAV strategy successfully mitigates detection failures and tracking losses in complex low-altitude environments, leading to enhanced overall system reliability.
Despite the promising results, this work has certain limitations that warrant discussion. Our multi-UAV cooperative framework currently relies on stable and low-latency communication links for effective coordination. While this assumption holds in our controlled experimental setup with proximal devices, practical deployments in vast or complex environments may suffer from packet loss, significant delays, or intermittent interruptions due to obstacles and electromagnetic interference. These real-world communication challenges could degrade system synchronization and coordination performance. Additionally, the trajectory planning module operates based on a pre-perceived or known static obstacle density map. Consequently, the algorithm’s adaptability in highly dynamic environments—where obstacles may move unpredictably (e.g., vehicles, birds, or other UAVs)—remains a challenge. The current approach may not react sufficiently fast to sudden environmental changes, potentially leading to suboptimal or unsafe paths.
Future research will focus on enhancing communication robustness through delay-tolerant networking protocols and developing real-time dynamic obstacle prediction mechanisms to improve planning reactivity and safety in fully unknown and dynamic scenarios.

Author Contributions

Conceptualization, Y.A., R.L., C.X. and X.L.; methodology, C.X.; software, R.L.; validation, Y.A., R.L. and C.X.; formal analysis, Y.A.; investigation, C.X.; resources, Y.A.; data curation, R.L.; writing—original draft preparation, R.L. and X.L.; writing—review and editing, Y.A., R.L. and C.X.; visualization, R.L.; supervision, Y.A.; project administration, Y.A.; funding acquisition, Y.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Program of Hunan Province, grant number 2024JK2083 and Xiangjiang Laboratory Major Projects, grant number 24XJ01002.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. San, C.T.; Kakani, V. Smart Precision Weeding in Agriculture Using 5IR Technologies. Electronics 2025, 14, 2517. [Google Scholar] [CrossRef]
  2. Li, J.; Hua, Y.; Xue, M. MSO-DETR: A Lightweight Detection Transformer Model for Small Object Detection in Maritime Search and Rescue. Electronics 2025, 14, 2327. [Google Scholar] [CrossRef]
  3. Ouyang, Y.; Liu, W.; Yang, Q.; Mao, X.; Li, F. Trust Based Task Offloading Scheme in UAV-Enhanced Edge Computing Network. Peer-to-Peer Netw. Appl. 2021, 14, 3268–3290. [Google Scholar] [CrossRef]
  4. Dong, L.; Liu, Z.; Jiang, F.; Wang, K. Joint Optimization of Deployment and Trajectory in UAV and IRS-Assisted IoT Data Collection System. IEEE Internet Things J. 2022, 9, 21583–21593. [Google Scholar] [CrossRef]
  5. Jiang, F.; Peng, Y.; Wang, K.; Dong, L.; Yang, K. MARS: A DRL-Based Multi-Task Resource Scheduling Framework for UAV with IRS-Assisted Mobile Edge Computing System. IEEE Trans. Cloud Comput. 2023, 11, 3700–3712. [Google Scholar] [CrossRef]
  6. Jiang, F.; Wang, K.; Dong, L.; Pan, C.; Xu, W.; Yang, K. AI Driven Heterogeneous MEC System with UAV Assistance for Dynamic Environment: Challenges and Solutions. IEEE Netw. 2020, 35, 400–408. [Google Scholar] [CrossRef]
  7. Chen, G.; Zhu, P.; Cao, B.; Wang, X.; Hu, Q. Cross-Drone Transformer Network for Robust Single Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4552–4563. [Google Scholar] [CrossRef]
  8. Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]
  9. Chinthi-Reddy, S.R.; Lim, S.; Choi, G.S.; Chae, J.; Pu, C. DarkSky: Privacy-Preserving Target Tracking Strategies Using a Flying Drone. Veh. Commun. 2022, 35, 100459. [Google Scholar] [CrossRef]
  10. Wang, K.; Yu, X.; Yu, W.; Li, G.; Lan, X.; Ye, Q.; Jiao, J.; Han, Z. ClickTrack: Towards Real-Time Interactive Single Object Tracking. Pattern Recogn. 2025, 161, 111211. [Google Scholar] [CrossRef]
  11. Lopez-Sanchez, I.; Moreno-Valenzuela, J. PID Control of Quadrotor UAVs: A Survey. Annu. Rev. Control. 2023, 56, 100900. [Google Scholar] [CrossRef]
  12. Song, Y.; Scaramuzza, D. Policy Search for Model Predictive Control with Application to Agile Drone Flight. IEEE Trans. Robot. 2022, 38, 2114–2130. [Google Scholar] [CrossRef]
  13. Saleem, O.; Kazim, M.; Iqbal, J. Robust Position Control of VTOL UAVs Using a Linear Quadratic Rate-Varying Integral Tracker: Design and Validation. Drones 2025, 9, 73. [Google Scholar] [CrossRef]
  14. Dang, Z.; Sun, X.; Sun, B.; Guo, R.; Li, C. OMCTrack: Integrating Occlusion Perception and Motion Compensation for UAV Multi-Object Tracking. Drones 2024, 8, 480. [Google Scholar] [CrossRef]
  15. Chang, Y.; Zhou, H.; Wang, X.; Shen, L.; Hu, T. Cross-Drone Binocular Coordination for Ground Moving Target Tracking in Occlusion-Rich Scenarios. IEEE Robot. Autom. Lett. 2020, 5, 3161–3168. [Google Scholar] [CrossRef]
  16. Hansen, J.G.; de Figueiredo, R.P. Active Object Detection and Tracking Using Gimbal Mechanisms for Autonomous Drone Applications. Drones 2024, 8, 55. [Google Scholar] [CrossRef]
  17. Meibodi, F.A.; Alijani, S.; Najjaran, H. A Deep Dive into Generic Object Tracking: A Survey. arXiv 2025, arXiv:2507.23251. [Google Scholar] [CrossRef]
  18. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual Object Tracking Using Adaptive Correlation Filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2544–2550. [Google Scholar]
  19. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the Circulant Structure of Tracking-by-Detection with Kernels. In Proceedings of the European Conference on Computer Vision, Florence Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar]
  20. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef]
  21. Yang, S. A Novel Study on Deep Learning Framework to Predict and Analyze the Financial Time Series Information. Future Gener. Comput. Syst. 2021, 125, 812–819. [Google Scholar] [CrossRef]
  22. Zhang, P.; Liu, X.; Li, W.; Yu, X. Pharmaceutical Cold Chain Management Based on Blockchain and Deep Learning. J. Internet Technol. 2021, 22, 1531–1542. [Google Scholar] [CrossRef]
  23. Shi, D.; Zheng, H. A Mortality Risk Assessment Approach on ICU Patients Clinical Medication Events Using Deep Learning. Comput. Model. Eng. Sci. 2021, 128, 161–181. [Google Scholar] [CrossRef]
  24. Tong, Y.; Sun, W. The Role of Film and Television Big Data in Real-Time Image Detection and Processing in the Internet of Things Era. J. Real-Time Image Process. 2021, 18, 1115–1127. [Google Scholar] [CrossRef]
  25. Zhou, W.; Zhao, Y.; Chen, W.; Liu, Y.; Yang, R.; Liu, Z. Research on Investment Portfolio Model Based on Neural Network and Genetic Algorithm in Big Data Era. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 228. [Google Scholar] [CrossRef]
  26. Zeng, Y.; Ouyang, S.; Zhu, T.; Li, C. E-Commerce Network Security Based on Big Data in Cloud Computing Environment. Mob. Inf. Syst. 2022, 2022, 9935244. [Google Scholar] [CrossRef]
  27. Tao, R.; Gavves, E.; Smeulders, A.W. Siamese Instance Search for Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
  28. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
  29. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
  30. Lin, L.; Fan, H.; Zhang, Z.; Wang, Y.; Xu, Y.; Ling, H. Tracking Meets Lora: Faster Training, Larger Model, Stronger Performance. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 300–318. [Google Scholar]
  31. Kopyt, A.; Narkiewicz, J.; Radziszewski, P. An Unmanned Aerial Vehicle Optimal Selection Methodology for Object Tracking. Adv. Mech. Eng. 2018, 10, 1–12. [Google Scholar] [CrossRef]
  32. Gong, K.; Cao, Z.; Xiao, Y.; Fang, Z. Abrupt-Motion-Aware Lightweight Visual Tracking for Unmanned Aerial Vehicles. Vis. Comput. 2021, 37, 371–383. [Google Scholar] [CrossRef]
  33. Lin, C.; Zhang, W.; Shi, J. Tracking Strategy of Unmanned Aerial Vehicle for Tracking Moving Target. Int. J. Control Autom. Syst. 2021, 19, 2183–2194. [Google Scholar] [CrossRef]
  34. Lee, K.; Chang, H.J.; Choi, J.; Heo, B.; Leonardis, A.; Choi, J.Y. Motion-Aware Ensemble of Three-Mode Trackers for Unmanned Aerial Vehicles. Mach. Vis. Appl. 2021, 32, 54. [Google Scholar] [CrossRef]
  35. Campos-Martínez, S.-N.; Hernández-González, O.; Guerrero-Sánchez, M.-E.; Valencia-Palomo, G.; Targui, B.; López-Estrada, F.-R. Consensus Tracking Control of Multiple Unmanned Aerial Vehicles Subject to Distinct Unknown Delays. Machines 2024, 12, 337. [Google Scholar] [CrossRef]
  36. Zhou, Z.; Hu, J.; Chen, B.; Shen, X.; Meng, B. Target Tracking and Circumnavigation Control for Multi-Unmanned Aerial Vehicle Systems Using Bearing Measurements. Actuators 2024, 13, 323. [Google Scholar] [CrossRef]
  37. Zhang, C.; Wang, Y.; Zheng, W. Multi-UAVs Tracking Non-Cooperative Target Using Constrained Iterative Linear Quadratic Gaussian. Drones 2024, 8, 326. [Google Scholar] [CrossRef]
  38. Upadhyay, J.; Rawat, A.; Deb, D. Multiple Drone Navigation and Formation Using Selective Target Tracking-Based Computer Vision. Electronics 2021, 10, 2125. [Google Scholar] [CrossRef]
  39. Liu, Y.; Li, X.; Wang, J.; Wei, F.; Yang, J. Reinforcement-Learning-Based Multi-Uav Cooperative Search for Moving Targets in 3D Scenarios. Drones 2024, 8, 378. [Google Scholar] [CrossRef]
  40. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
  41. Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 363–380. [Google Scholar]
  42. Zhong, S.; Wen, W.; Qin, J. Mix-Pooling Strategy for Attention Mechanism. arXiv 2022, arXiv:2208.10322. [Google Scholar] [CrossRef]
  43. Kang, B.; Chen, X.; Wang, D.; Peng, H.; Lu, H. Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9612–9621. [Google Scholar]
  44. Chelly, I.; Finder, S.E.; Ifergane, S.; Freifeld, O. Trainable Highly-Expressive Activation Functions. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 200–217. [Google Scholar]
  45. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  46. Liu, Z.; Shang, Y.; Li, T.; Chen, G.; Wang, Y.; Hu, Q.; Zhu, P. Robust Multi-Drone Multi-Target Tracking to Resolve Target Occlusion: A Benchmark. IEEE Trans. Multimed. 2023, 25, 1462–1476. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.