Real-Time Occluded Target Detection and Collaborative Tracking Method for UAVs

Yandi Ai; Ruolong Li; Chaoqian Xiang; Xin Liang

doi:10.3390/electronics14204034

,

and

¹

School of Frontier Interdisciplinary Studies, Hunan University of Technology and Business, Changsha 410205, China

²

School of Intelligent Engineering and Intelligent Manufacturing, Hunan University of Technology and Business, Changsha 410205, China

³

Xiang Jiang Laboratory, Changsha 410205, China

⁴

School of Artificial Intelligence and Advanced Computing, Hunan University of Technology and Business, Changsha 410205, China

Electronics2025, 14(20), 4034;https://doi.org/10.3390/electronics14204034

This article belongs to the Special Issue Digital Intelligence Technology and Applications, 2nd Edition

Version Notes

Order Reprints

Abstract

To address the failure of unmanned aerial vehicle (UAV) target tracking caused by occlusion and limited field of view in dense low-altitude obstacle environments, this paper proposes a novel framework integrating occlusion-aware modeling and multi-UAV collaboration. A lightweight tracking model based on the Mamba backbone is developed, incorporating a Dilated Wavelet Receptive Field Enhancement Module (DWRFEM) to fuse multi-scale contextual features, significantly mitigating contour fragmentation and feature degradation under severe occlusion. A dual-branch feature optimization architecture is designed, combining the Distilled Tanh Activation with Context (DiTAC) activation function and Kolmogorov–Arnold Network (KAN) bottleneck layers to enhance discriminative feature representation. To overcome the limitations of single-UAV perception, a multi-UAV cooperative system is established. Ray intersection is employed to reduce localization uncertainty, while spherical sampling viewpoints are dynamically generated based on obstacle density. Safe trajectory planning is achieved using a Crested Porcupine Optimizer (CPO). Experiments on the Multi-Drone Multi-Target Tracking (MDMT) dataset demonstrate that the model achieves 84.1% average precision (AP) at 95 Frames Per Second (FPS), striking a favorable balance between speed and accuracy, making it suitable for edge deployment. Field tests with three collaborative UAVs show sustained target coverage in complex environments, outperforming traditional single-UAV approaches. This study provides a systematic solution for robust tracking in challenging low-altitude scenarios.

Keywords:

Mamba; pattern recognition; target tracking; multi-unmanned aerial vehicle cooperation

1. Introduction

Drones are widely employed in various domains, such as power systems, agriculture, disaster relief, logistics, intelligent transportation, and environmental protection [,,]. Their adoption is driven by superior maneuverability, flexible deployment capabilities, and expansive aerial perspectives [,,]. However, during low-altitude flight operations, complex obstacle environments frequently obstruct the drone’s field of view, not only degrading target tracking performance but also posing significant threats to flight safety.

The target tracking mission for drones comprises two distinct phases: target perception and tracking control. During the target perception phase, sensors including visible-light cameras, infrared cameras, and LiDAR collect environmental data [,,]. Leveraging computer vision techniques, this phase accomplishes target detection, recognition, and localization, thereby achieving dynamic perception and state estimation of the target. Single-Object Tracking (SOT), as the core technology of the perception phase, enables continuous localization of specific targets by dynamically modeling appearance variations and motion trajectories [].

During the tracking control phase, flight trajectories and gimbal attitudes are dynamically adjusted based on real-time perception results. Control algorithms including Proportional-Integral-Derivative (PID), Model Predictive Control (MPC), and Linear Quadratic Regulator (LQR) regulate flight velocity and pose to maintain optimal proximity and observational perspective toward the target, ensuring persistent tracking stability [,,]. When executing target tracking missions in low-altitude, obstacle-dense environments, drones frequently lose tracked objects due to targets becoming partially obscured. Although existing single-drone tracking methods can mitigate local occlusion challenges, they remain constrained by the limited perceptual coverage of individual platforms, hindering real-time observational perspective adjustments for highly maneuverable targets [,,]. Particularly during abrupt target maneuvers or within densely cluttered regions, the restricted sensing footprint of single drones inevitably causes targets to persistently fall out of the field of view.

Compared to a single UAV, multi-UAV systems significantly enhance the robustness and continuity of target tracking in complex obstacle-laden environments by leveraging multi-view synchronous perception and collaborative tracking mechanisms.

Key Points of the Article

To address the persistent challenges of target occlusion and limited field-of-view encountered by single drones in low-altitude, obstacle-dense environments, this paper proposes a multi-UAV Collaborative Occlusion-Aware Tracking Method:

To address the challenges of contour discontinuity and feature discriminability degradation in heavily occluded environments, an occlusion-robust tracking methodology is proposed. This approach employs a Mamba backbone architecture integrated with a DWRFEM, which amplifies local feature extraction capabilities. Multi-scale contextual information is fused via Wavelet Depthwise Separable Dilated Convolutions (WDSDConv), preserving holistic target contours while enhancing obscured edge detail resolution. Simultaneously, a dual-branch feature refinement framework incorporates DiTAC to elevate nonlinear feature representation, complemented by integrated KAN bottleneck layers for target saliency amplification. This methodology substantially mitigates occlusion-induced feature degradation, enabling sustained tracking robustness under severe occlusion scenarios.
To overcome tracking interruptions caused by individual drones’ field-of-view limitations and complete occlusions, a multi-UAV collaborative tracking framework is introduced. First, cooperative localization is established via multi-UAV ray intersection to reduce positioning uncertainty. Subsequently, an adaptive spherical sampling algorithm dynamically generates occlusion-free viewpoint distributions based on obstacle density, ensuring continuous target presence within the collective visual coverage. Finally, flight path optimization integrates the CPO for smooth trajectory generation, while real-time yaw angle adjustments maintain target-centering within the visual field.

2. Related Works

2.1. Single-Object Tracking

Based on technical principles, SOT methodologies can be classified into filter-based approaches and deep learning-based methodologies []. Filter-based SOT methodologies achieve state estimation through modeling target kinematics and observation noise. Bolme et al. [] developed the pioneering Minimum Output Sum of Squared Error (MOSSE) filter, which first introduced correlation filter theory to real-time object tracking. This method successfully addressed the persistent challenge of balancing computational efficiency and robustness in conventional approaches. Henriques et al. [] subsequently enhanced this framework by integrating the Kernel Trick, thereby overcoming MOSSE’s limitation to linear space while enabling high-velocity training and detection under dense sampling conditions via circulant matrix theory. KCF [] further advanced the paradigm by establishing a circulant matrix-based analytic model, which achieves matrix diagonalization through Discrete Fourier Transform (DFT) methodology, consequently effectuating substantial reductions in memory consumption and computational overhead.

Driven by advancements in computational hardware [,,] and the advent of the big data era [,,], the technical paradigm for SOT has progressively shifted from correlation filters to data-driven deep learning frameworks, achieving breakthroughs in precision and robustness under complex scenarios. Tao et al. [] proposed the Siamese INstance search Tracker (SINT), which employs a pre-trained Siamese network to learn a generic matching function. This approach accomplishes tracking solely through matching the first-frame target template with candidate regions, eliminating requirements for model updates, occlusion handling, or geometric matching. Li et al. [] introduced a Siamese-Region Proposal Network (RPN) fusion framework, enabling end-to-end training for unified target localization and scale regression. This methodology resolves critical limitations of conventional approaches, specifically their dependency on multi-scale testing and susceptibility to localization inaccuracies. Whereas local cross-correlation operations in Siamese networks inadequately model global dependencies, Transformer architectures overcome this constraint via self-attention mechanisms. Chen et al. [] pioneered the TransT tracker by integrating self-attention and cross-attention mechanisms into visual tracking, replacing traditional correlation filtering operations. This innovation addresses semantic information loss stemming from local matching. Lin et al. [] developed the Low-Rank Adaptation Tracker (LoRAT), adapting the Low-Rank Adaptation (LoRA) fine-tuning technique from large language models to visual tracking. This approach facilitates computationally efficient training of large Vision Transformer (ViT) models. Notwithstanding these advances, the quadratic computational complexity of Transformer’s self-attention mechanism necessitates lightweight architectural optimization for deployment on resource-constrained edge devices such as drones.

Although Transformer-based single-object trackers have achieved remarkable breakthroughs in tracking accuracy, the quadratic computational complexity (O(N²)) introduced by their self-attention mechanisms remains a significant burden for resource-constrained edge devices. This computational overhead not only limits the real-time performance of models but also increases power consumption, making practical deployment challenging. Therefore, how to reduce computational complexity while maintaining tracking accuracy, and developing lightweight tracking architectures suitable for edge devices has become one of the key research challenges. This motivates our exploration of Mamba-based alternatives, whose linear complexity characteristics are better suited for resource-constrained scenarios like UAV applications.

2.2. UVA Tracking Control

Single-UAV tracking control technology primarily focuses on enhancing the tracking performance of individual platforms in complex environments through optimization algorithms, adaptive strategies, and novel control architectures. Kopyt et al. [] proposed an optimal UAV selection method that improves tracking efficiency by incorporating target trajectory prediction, thereby providing decision-making support for single-UAV tracking operations. Gong et al. [] developed a lightweight correlation filter tracker, which adaptively estimates the platform’s motion state and adjusts the search area via keypoint matching, significantly improving computational efficiency and tracking robustness. Lin et al. [] introduced a combined framework comprising Gimbal Control Algorithm based on Motion Compensation (GCAMC) and Improved Reference Point Guidance Method (IRPGM), effectively reducing the target loss probability. Lee et al. [] designed a three-mode integrated tracker that combines appearance, homography, and motion tracking modes, offering a robust solution particularly against unexpected camera motions during UAV operations. Although these methods demonstrate remarkable performance in single-UAV scenarios, the limited perceptual range of a single platform makes it susceptible to target loss due to occlusions in complex obstacle-rich environments. Furthermore, a single UAV struggles to simultaneously meet multi-angle observation requirements and exhibits limited capability in tracking highly maneuverable targets.

Multi-UAV cooperative tracking control leverages distributed algorithms and system-wide optimization to achieve robust target tracking through collaborative efforts among multiple platforms. Campos-Martínez et al. [] proposed a consensus-based tracking control algorithm using unknown input observers, effectively addressing stability issues in multi-UAV systems under time-varying delays. Zhou et al. [] developed a distributed estimation and control protocol based on bearing-only measurements, which significantly enhances tracking accuracy for high-speed targets by integrating Kalman filtering. Zhang et al. [] investigated a two-stage optimization approach employing the Constrained Iterative Linear Quadratic Gaussian (CILQG) algorithm, achieving precise reference trajectory tracking while minimizing target uncertainty. Upadhyay et al. [] implemented a low-cost, high-precision multi-UAV collaborative tracking approach through a master-slave collaborative template-sharing mechanism and relative position decoupling. Liu et al. [] achieved collaborative tracking of moving targets by multi-UAV in 3D space via a high-low altitude cooperative architecture and an action masking-based collision avoidance mechanism.

Unlike these collaborative tracking systems that simply merge detection results, we establish an occlusion-aware feature fusion mechanism through multi-UAV ray intersection and density-adaptive spherical sampling. This enables continuous target visibility assessment and viewpoint optimization in real-time.

2.3. Motivation of the Article

At the theoretical level, existing CNN-based tracking algorithms are constrained by their local receptive fields, which limits their ability to model long-range spatial dependencies. While Transformer architectures can capture global contextual information, their quadratic computational complexity makes them unsuitable for real-time applications on edge devices. To address this, we innovatively developed a lightweight object tracking framework based on Mamba, exploring a technical pathway for efficient tracking in edge computing environments. On the practical front, frequent target occlusions in complex low-altitude scenarios often lead to tracking failures in conventional single-UAV systems. The multi-UAV cooperative system proposed in this study effectively mitigates tracking interruptions caused by occlusions, providing a feasible technical solution for real-world engineering applications.

3. Occlusion-Robust Target Tracking Methodology

Target occlusion in low-altitude complex obstacle environments frequently induces detection failures and target loss. To address this, we propose an occlusion-robust target tracking methodology based on a Mamba Backbone []. The framework comprises three core components (Figure 1): Mamba Backbone, Bottleneck Layer, and Detection Head. The Mamba Backbone integrates Visual State-Space (VSS) Block modules and Spatial Pyramid Pooling Fast (SPPF) modules. Within each VSS Block, the DWRFEM augments contextual semantic representation through multi-scale feature extraction, strengthening pixel-level feature discrimination. Template and search region features undergo fusion within the Bottleneck Layer, which employs a multi-level pyramid architecture grounded in self-attention mechanisms. This design facilitates hierarchical extraction and integration of global contextual features across spatial resolutions.

Figure 1. Framework of the proposed occlusion-robust target tracking methodology. The red boxes in the figure indicate the tracking detection results.

3.1. Mamba Backbone

The Mamba Backbone constitutes the core architecture of our tracking methodology. It captures long-range spatial dependencies across video frames—including positional and postural variations—enabling state prediction via target templates. This capability enhances tracking stability and temporal coherence while reducing target loss likelihood during transient occlusions. Furthermore, the network’s hierarchical feature extraction integrates local-to-global spatial semantics, improving adaptation to pose deformation, scale variation, and other complex scenarios. These attributes collectively enhance tracking precision and robustness. Architecturally, the Mamba Backbone stacks multiple VSS Blocks with Downsampling modules, culminating in an SPPF module for multi-scale feature fusion (Figure 2).

Figure 2. Mamba backbone network architecture.

3.1.1. VSS Block

The VSS Block serves as the fundamental component of the Mamba backbone network, primarily responsible for visual feature extraction and information propagation. It employs a 2D Selective Scan (SS2D) module to perform multi-directional scanning of input images, capturing contextual information and spatial features (Figure 3). To further enhance the model’s capacity for local feature extraction and contextual integration in visual data, we introduce a DWRFEM based on wavelet convolution []. This module augments depth-wise receptive fields through wavelet transformations, thereby improving feature representation capabilities.

Figure 3. VSS block networks architecture.

The DWRFEM incorporates dual principal pathways (Figure 4). The primary branch performs conventional feature extraction via convolutional operations. The secondary branch decomposes into three parallel sub-branches, each employing WDSDConv. By utilizing distinct dilation rates in the WDSDConv layers, these sub-branches extract features at different spatial granularities, thereby facilitating multi-scale feature fusion.

Figure 4. DWRFEM networks architecture.

The WDSDConv module (Figure 5) substitutes standard 3 × 3 convolution with depthwise dilated convolution (DDConv) and 1 × 1 convolution 41. DDConv applies a single convolutional filter per input channel with a specified dilation rate d. The output for channel c is computed as

{\hat{Y}}_{c} = \sum_{k = 1}^{K} \sum_{l = 1}^{K} W_{c}^{(d)} (k, l) \cdot X_{c} (x + d \cdot k, y + d \cdot l)

(1)

where

W^{(d)} \in R^{C_{i n} \times K \times K}

(

C_{i n}

is the number of input channels, and

C_{i n} = 1

in DDConv) is the depthwise kernel, K is the kernel size (default

K = 3

), and

d

is the dilation rate.

Figure 5. WDSDConv architecture. WT stands for wavelet TRANSFORM and IWT stands for inverse wavelet transform.

WDSDConv delivers dual improvements: dilated convolution strategically expands the receptive field through kernel sparsity, while depthwise separable convolution reduces computational demands by factorizing operations into depthwise and pointwise components. WDSDConv preserves wavelet transformations’ multi-resolution analysis capabilities while simultaneously enhancing global contextual modeling efficiency and minimizing computational overhead.

3.1.2. MixPooling SPPF

The SPPF module captures multi-scale features through hierarchical pooling operations, thereby accommodating targets of varying sizes. To enhance generalization performance, we substitute the standard MaxPooling in SPPF with MixPooling [] (Figure 6). This strategic modification significantly improves model robustness when processing diverse input data distributions.

Figure 6. SPPF networks architecture.

MixPooling stochastically selects pooling operators during training, introducing adaptive uncertainty that compels the model to diversify feature representations. This mechanism mitigates over-reliance on specific features by dynamically alternating between maximum and average pooling. The operation is formally defined as

P_{i j} = \{\begin{matrix} \max (R_{i j}), if δ = 0 \\ avg (R_{i j}), if δ = 1 \end{matrix}

(2)

where

R_{i j}

denotes elements in the pooling region, and

δ

is a stochastic selector determining the pooling modality.

3.2. Bottleneck Layer

The Bottleneck Layer employs a multi-layer pyramidal architecture based on self-attention mechanisms (Figure 7) []. Its primary function involves performing deep global feature extraction and fusion between template image features and search region features, thereby generating more discriminative fused representations.

Figure 7. Bottleneck layer networks architecture.

3.2.1. Feature Extraction with MHA

Self-attention mechanisms have gained widespread adoption in computer vision for their capacity to capture long-range dependencies. By computing cross-position feature correlations, they enhance feature representational power through contextual integration. Within the bottleneck layer architecture, this capability is primarily implemented via two core modules: the Multi-Head Attention (MHA) and Scaled Attention (SA) mechanisms. The MHA operation is formally defined as

Attn (Q, K, V, B) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + B) V

(3)

H_{i} = Hardswish (Attn (X_{i n p u t} W_{i}^{Q}, X_{i n p u t} W_{i}^{K}, X_{i n p u t} W_{i}^{V}, B_{i}))

(4)

MHA (X_{i n p u t}) = Concat (H_{1}, H_{2}, \dots, H_{N}) W^{O}

(5)

where

X_{i n p u t}

denotes input features,

B_{i}

represents positional bias,

N

specifies the number of attention heads, and

W_{i}^{Q}

,

W_{i}^{K}, W_{i}^{V}

,

W^{O}

correspond to learnable weight matrices.

To enhance model performance, we refine both SA and MHA modules by substituting Hardswish activations with the DiTAC function []:

DiTAC (x) = \tilde{x} \cdot F (x)

(6)

\tilde{x} = \{\begin{matrix} T^{θ} (x) & If a \leq x \leq b \\ x & Otherwise \end{matrix}

(7)

where x denotes the input, F(x) represents the cumulative distribution function of the standard normal distribution, and

T^{θ} (x)

constitutes a learnable diffeomorphic transformation defined on domain

[a, b]

. The DiTAC activation function, as a trainable activation mechanism based on diffeomorphic transformations, provides richer nonlinear characteristics compared to Hardswish. By applying diffeomorphic transformations across varying domains, DiTAC adaptively reconfigures its activation profile to better accommodate diverse feature distributions. This modification not only enhances the model’s feature representational capacities but also effectively mitigates overfitting while improving generalization performance.

3.2.2. Feature Enhancement and Fusion with KAN

To strengthen nonlinear fitting capabilities, we implement KAN [] within the Bottleneck Layer as a substitute for conventional Multi-Layer Perceptron (MLP) modules.

MLP modules utilize fixed activation functions to perform nonlinear transformations at each node. Their output is obtained by applying activation functions to the product of inputs and weight matrices (Figure 8a):

MLP (x) = W_{2} \circ σ \circ W_{1} (x)

(8)

where

W_{1}

and

W_{2}

are linear weight matrices, and

σ

denotes a fixed activation function. In contrast, KAN places learnable activation functions on edges (weights), enabling dynamic adaptation to data distributions (Figure 8b):

KAN (x) = ψ_{2} \circ ψ_{1} (x)

(9)

where

ψ_{1}

and

ψ_{2}

are function matrices comprising learnable activation functions. Each activation function is parameterized via spline curves, enabling dynamic adaptation to data distributions. we adopt cubic B-splines (degree

= 3

) and incorporate a base SiLU activation function for stable training:

ψ (x) = Silu (x) + \sum_{g = 1}^{G} ε_{g} \cdot B_{g} (x)

(10)

where

B_{g}

denotes the g-th B-spline basis function of a fixed order,

E_{g}

are the trainable coefficients that define the shape of the activation function,

G

is the number of basis functions (we set

G = 5

). By incorporating the KAN module, the model enhances nonlinear fitting capabilities while maintaining computational efficiency.

Figure 8. KAN architecture.

4. Multi-Perspective Collaborative Tracking with Multi-UAV

4.1. Collaborative Target Localization with Multi-UAV

This paper employs multi-UAV collaborative localization to determine target positions. Given known UAV positions, the relative angle between target and UAV is determined based on the bounding box centroid in the first-view perspective and the camera model. Combining UAV position, orientation, and this relative angle yields ray

l_{k}

from target to UAV k in the world coordinate system. For any two UAVs

k_{1}

and

k_{2}

, point

p_{k_{1} k_{2}}

satisfies the minimization of summed distances from

p_{k_{1} k_{2}}

to points

l_{k_{1}}

and

l_{k_{2}}

:

\min (D (l_{k_{1}}, p_{k_{1} k_{2}}) + D (l_{k_{2}}, p_{k_{1} k_{2}}))

(11)

where

D (l, p)

represents the distance from observation position

p

to ray

l

. This is geometrically equivalent to finding the ‘midpoint’ of the shortest segment connecting the two non-coplanar rays, providing a best-fit intersection point for this UAV pair. The final target position is then determined by fusing all such pairwise estimates:

\min (\sum_{a, b \in (1, 2, \dots, m), a < b} D (p_{t a r g e t}, p_{k_{a} k_{b}}))

(12)

where

D (p_{1}, p_{2})

denotes the distance between two points, and m represents the number of UAVs. This method, known as finding the geometric median, enhances robustness against outliers and measurement noise from any single UAV. For target trajectory prediction, B-spline curves are employed to fit historical observations of target positions, and a Kalman filtering algorithm is utilized to obtain future waypoints.

4.2. Spherical Visibility Sampling with Viewpoint Optimization

In dynamic environments, target motion exhibits high randomness and unpredictability. Abrupt directional changes may cause traditional tracking methods to lose targets. To address this challenge, the multi-UAV system must acquire all observable regions surrounding the target in real-time, enabling dynamic perception and comprehensive coverage. This paper presents an obstacle-aware visible region generation method based on spherical sampling to determine multi-UAV observation points.

Given known map and obstacle data, a sampling sphere S centered at target position

p_{t a r g e t} ϵ R^{3}

is constructed. Its radius r is determined by the UAV’s maximum observation range and environmental safety margin:

r = \min (d_{\max}, \underset{o \in O_{o b s}}{\min {‖p_{t a r g e t} - o‖}_{2} - δ_{s a f e}})

(13)

where

d_{m a x}

denotes the predefined maximum observation distance,

O_{o b s}

represents the obstacle set, and

δ_{s a f e}

is the safety buffer distance (defaulting to

δ_{s a f e} = 0.5 m

). Equation (13) dynamically defines a safety-aware sphere that ensures both observation validity and flight safety. The spherical sampling point set

{\{s_{i}\}}_{i = 1}^{N}

is generated via spherical coordinate parameterization:

s_{i} = p_{t a r g e t} + r [\begin{matrix} \sin θ_{i} \cos ϕ_{i} \\ \sin θ_{i} \sin ϕ_{i} \\ \cos θ_{i} \end{matrix}], ϕ_{i} \in [0, 2 π], θ_{i} \in [0, π]

(14)

where

ϕ_{i}

is the azimuth angle,

θ_{i}

denotes the zenith angle. Equation (14) generates a uniform set of candidate viewpoints on the surface of the aforementioned sphere. The number of sampling points

N_{s}

is adaptively adjusted based on environmental complexity and the environmental complexity is quantitatively measured by the density of obstacles within the sphere:

N_{s} = N_{b a s e} \cdot (1 + η \cdot \frac{|O_{o b s} \cap B (p_{t a r g e t}, r)|}{4 π r^{2} / ∆ A_{g r i d}})

(15)

where

N_{b a s e}

is the base sampling count (defaulting to

N_{b a s e} = 100

),

|O_{o b s} ⋂ B (p_{t a r g e t}, r)|

denotes the number of occupied grids by obstacles within the sphere,

Δ A_{g r i d}

represents the grid map resolution, and

η

is the density sensitivity coefficient (

η = 0.5

). Equation (15) enables the algorithm to dynamically adapt the sampling density, allocating more computational resources for fine-grained sensing in complex environments.

4.2.1. Line-of-Sight Reachability Detection

For each sampling point

s_{i}

, collision detection is performed between line

p_{t a r g e t} s_{i}

and obstacles:

V i s i b l e (s_{i}) = \{\begin{matrix} 1 & i f \forall r \in \bar{p_{t a r g e t} s_{i}}, r \notin O_{o b s} \\ 0 & o t h e r w i s e \end{matrix}

(16)

where

O_{o b s}

represents the obstacle set. Equation (16) checks if the entire segment between the target and the candidate viewpoint lies in free space, ensuring an unobstructed view for observation. Line-of-sight reachability is determined via ray-triangle intersection detection:

r (t) = p_{t a r g e t} + t (s_{i} - p_{t a r g e t}), t \in [0, 1]

(17)

Occlusion is determined if the ray intersects any obstacle triangle patch. Equation (17) parameterizes the continuous line segment from the target point to the sampling point as a ray, forming a computable mathematical path for collision detection algorithms.

4.2.2. Visibility Volume Fusion

The visible sampling point set

\{s_{i}^{v i s}\}

is projected onto the tangent plane at the target point (with normal vector

n_{t}

aligned with the gravity direction), generating a two-dimensional point set

{\{v_{i}\}}_{i = 1}^{M}

:

v_{i} = s_{i}^{v i s} - [(s_{i}^{v i s} - p_{t a r g e t}) \cdot n_{t}] n_{t}

(18)

The convex hull H of point set

\{v_{i}\}

is computed, and its boundary vertex sequence

{(V_{k})}_{k = 1}^{K}

defines the visible region boundary. These convex hull vertices are converted to polar coordinates

(ρ_{k}, α_{k})

, with visibility sectors generated by coalescing continuous angular intervals:

Ω_{j} = {α | α_{j}^{s t a r t} \leq α \leq α_{j}^{e n d}}, j = 1, \dots, L

(19)

Region size is quantified by the central angle

∆ α_{j} = α_{j}^{e n d} - α_{j}^{s t a r t}

.

4.2.3. Adaptive Sampling Optimization

To enhance computational efficiency, this paper introduces an importance sampling strategy. We first reduce sampling density in obstacle-dense directions by defining a directional weighting function:

w (ϕ, θ) = \exp (- λ \cdot d_{o b s} (ϕ, θ))

(20)

where

d_{o b s} (ϕ, θ)

denotes the Euclidean distance to the nearest obstacle in direction

(ϕ, θ)

, and

λ

is the attenuation coefficient (default

λ = 0.1

). Sampling probability is adjusted as follows:

P (ϕ_{i}, θ_{i}) = \frac{w (ϕ_{i}, θ_{i})}{\sum_{j = 1}^{N} w (ϕ_{j}, θ_{j})}

(21)

Sampling points are selected via roulette wheel selection to avoid inefficient resource allocation caused by uniform sampling.

The spherical-sampling-based visible region generation and viewpoint optimization algorithm is implemented in Algorithm 1.

Algorithm 1: Spherical-sampling-based visible region generation and viewpoint optimization algorithm

Input: Target position point

p_{t a r g r t}

, obstacle set

O_{o b s}

, maximum observation distance

d_{m a x}

, UAV count m, base sampling count

N_{b a s e}

, density sensitivity coefficient

η

, grid map resolution

Δ A_{g r i d}

.
Output: Observation point collection

p_{o b v} .

1: Initialize visible point set

V = \emptyset

;
2: Compute adaptive sample count N;
3: FOR i = 1 to N do
4: Generate sampling point

s_{i}

on sphere (center

p_{t a r g r t}

) via spherical parametrization;
5: IF ray

p_{t a r g r t} \to s_{i}

is unobstructed THEN
6: Add

s_{i}

to

V

;
7: END IF
8: END FOR
9: Project

V

onto tangent plane at

p

, obtain 2D point set

{\{v_{i}\}}_{i = i}^{M}

;
10: Compute convex hull of

v_{i}

;
11: Merge angular intervals of the convex hull vertices to generate visible regions

Ω_{j}

;
12: IF m ≤ number of visible regions THEN
13: Generate observation points at centroids of

Ω_{j}

, store in

p_{o b v}

;
14: ELSE
15: Use Particle Swarm Optimization (PSO) to allocate m observation points within

Ω_{j}

, store in

p_{o b v}

;
16:END IF
17: RETURN

p_{o b v}

;

4.3. Multi-UAV Trajectory Planning

Based on the multiple observation points generated in Section 3.2, this paper employs linear programming to assign each observation point to a specific UAV. The optimization objectives include: (1) minimizing total path costs for all UAVs; (2) ensuring flight altitude compliance with safety constraints to avoid collisions with terrain or obstacles; (3) maintaining minimum safety distances between UAVs. Formally, the multi-UAV waypoint assignment objective is expressed as

J (X_{p a t h}) = \sum_{i}^{m} x_{i k} (w_{1} F_{p a t h} + w_{2} F_{h e i g h t}) + η V i o l a t i o n (P)

(22)

F_{p a t h} = \sum_{k = 1}^{N - 1} D (u_{i, k + 1}, u_{i, k})

(23)

F_{h e i g h t} = \sum_{k = 1}^{N} \max {(0, h_{\min} - z_{i, k}, z_{i, k} - h_{\max})}^{2}

(24)

D (u_{i}, u_{j}) \geq d_{s a f e}, \forall i \neq j

(25)

s . t . \sum_{k = 1}^{n} x_{i k} = 1, \forall_{i}

(26)

\sum_{i = 1}^{n} x_{i k} = 1, \forall k

(27)

x_{i k} \in {0, 1}, \forall i, k

(28)

where

X_{p a t h}

is the path parameter matrix,

η

denotes the constraint penalty coefficient,

w_{1}

and

w_{2}

are weighting parameters,

F_{p a t h}

represents path length cost (i.e., Euclidean distance between two points),

F_{h e i g h t}

is flight altitude cost;

x_{i k}

constitutes the decision variable where

x_{i k} = 1

indicates the start point i and target point k pairing, and

x_{i k} = 0

otherwise; Equations (26)–(28) ensure one-to-one correspondence between UAVs and observation points.

Subsequently, this paper employs a multi-UAV path planning algorithm based on the CPO to assign each UAV to a corresponding observation point. The multi-UAV path planning algorithm using CPO is presented in Algorithm 2.

Algorithm 2: Multi-UAV path planning via crested porcupine optimizer

Input: Start point set

\{p_{i}^{s t a r t}\}

, observation point set

\{p_{i}^{o b v}\}

, threat model

O_{o b s}

, weight vector w
Output: Optimal path set

p^{*}

1: Initialize crested porcupine path parameter population

P = {\{X_{s}\}}_{s = 1}^{S}

;
2:FOR iter=1 to MaxIter do
3: FOR each individual

X_{s}

in

P

do
4: Compute path

P_{s}

via Dubins(

X_{s}

);
5: Calculate cost:

J_{s} = \sum_{i = 1}^{m} J (p_{s}^{(i)})

;
6: Apply constraint penalty:

J_{s} = J_{s} + η \cdot V i o l a t i o n (P_{s}) (η

: penalty coefficient);
7: END FOR
8: Determine leader:

X_{l e a d e r}

= argmin(

J_{s}

)
9: FOR each individual

X_{s}

in

P

do
10: IF rand() <

P_{e x p l o r e}

THEN
11: Random exploration:

X_{s}^{'} = X_{s} + α (X_{r a n d} - X_{s})

(

α

: step size,

X_{r a n d}

: random individual);
12: ELSE
13: Threat-driven adjustment:

X_{s}^{'} = X_{s} + β (X_{l e a d e r} - X_{s}) + γ Δ X_{t h r e a t}

(

β

and

γ

are weights,

Δ X_{t h r e a t}

is derived from threat intelligence);
14:      END IF
15:   END FOR
16:   Update population

P

: Combine the incumbent population and newly generated individuals;
17: END FOR
18: return Leader solution

X_{l e a d e r}

(convert to path sets

p^{*}

);

B-spline curves are subsequently employed to fit the optimized waypoints, ensuring kinematic feasibility:

p (μ) = \sum_{i = 0}^{n} c_{i} B_{i, k} (μ), μ \in [0, 1]

(29)

where

B_{i, k} (μ)

denotes the k-th order B-spline basis function, and

c_{i}

represents the control points.

5. Experiments

5.1. Evaluation of Occlusion-Robust Target Tracking Methodology

This paper conducts algorithm training and model testing on the MDMT dataset []. The dataset contains data collected by two UAVs at varying flight altitudes and viewpoints, focusing on vehicle targets. MDMT comprises 44 video sequence pairs totaling 39,678 frames, with 11,454 distinct IDs (pedestrians, bicycles, and cars) and 2,204,620 bounding boxes—543,444 of which contain occluded targets. The dataset partitions these sequences into 25 pairs for training, 14 for testing, and 5 for validation. Although designed for multi-object tracking, MDMT also supports single-object tracking research. Crucially, its multi-perspective data provides rich information that enhances model capability in recognizing occluded targets, thereby supporting multi-UAV collaborative tracking tasks.

Experiments were conducted on Rocky Linux 8.9 with two NVIDIA A100-SXM4 GPUs (40GB VRAM each), manufactured by NVIDIA Corporation, headquartered in Santa Clara, California, USA. The network model was built using the PyTorch framework version 2.4.0. The input size of target templates was resized to 128 × 128 pixels, while the search region was resized to 256 × 256 pixels. Given the absence of abrupt motion changes in targets, the search region for the next frame was constrained to an area five times the bounding box size centered at the previous target position, without exceeding image boundaries. Training parameters were configured as follows: batch size 16, total epochs 500, learning rate 0.0001, weight decay 0.0001, and Adam optimization strategy.

The AP is adopted as the key metric to comprehensively evaluate the localization accuracy of the proposed model. This metric is computed as the arithmetic mean of precision values obtained at 100 different Intersection over Union (IoU) thresholds, ranging from 0.00 to 1.00 with a step size of 0.01. This approach mitigates the arbitrariness of selecting a single threshold and provides a more robust assessment of model performance across various levels of localization strictness. The AP is calculated as follows:

A P = \frac{1}{100} \sum_{i = 1}^{100} P_{i}

(30)

where

P_{i}

represents the precision at the i-th IoU threshold.

Using our proposed model as the baseline, we designed five ablation studies. We first replaced the backbone network with ResNet18 and ResNet50 for comparison (Table 1). The results show that the Mamba backbone adopted in this work significantly improves accuracy by 5.0 percentage points compared to ResNet18, while maintaining higher processing speed. When compared to ResNet50, it achieves a slight accuracy improvement of 0.3 percentage points, along with a remarkable 171.4% increase in FPS. These experimental results demonstrate the advantage of our Mamba backbone in modeling long-range spatial dependencies.

Table 1. Ablation results of backbone networks.

According to the ablation results shown in Table 2, replacing the proposed DWRFEM with standard DWConv in the VSS Block leads to a noticeable performance-efficiency trade-off. While DWConv achieves higher computational efficiency at 103 FPS, it results in a reduction of 3.0 percentage points in AP. These results demonstrate the effectiveness of the DWRFEM in enhancing feature representation capabilities. Although introducing additional computational overhead, the module improves tracking accuracy by capturing features at different spatial granularities and expanding the receptive field through wavelet transformation.

Table 2. Ablation results of the VSS block.

According to the ablation experimental results of the Mixpooling SPPF shown in Table 3, the model achieved an average precision of 82.4% when the SPPF module was removed (None). After introducing the hybrid pooling SPPF module, the precision increased by 1.7%. The results indicate that the multi-scale feature extraction path constructed by SPPF enhances the model’s robustness in handling targets with scale variations, demonstrating the effectiveness of this module in improving feature representation capabilities.

Table 3. Ablation results of the mixpooling SPPF.

According to the ablation experimental results of activation functions shown in Table 4, adopting the DiTAC activation function led to a 1.2 percentage point increase in AP compared to using the Hardswish activation function. The results indicate that, compared to Hardswish, DiTAC more effectively models multi-scale feature dependencies, enhances the model’s adaptability to complex feature distributions, and ultimately improves both feature representation capability and generalization performance while maintaining computational efficiency.

Table 4. Ablation results of activation functions.

According to the ablation experimental results of the bottleneck layer architecture shown in Table 5, adopting the KAN module led to a 0.9 percentage point improvement compared to the traditional MLP structure. This enhancement can be attributed to KAN’s learnable activation function design, which enables adaptive fitting to data distributions and strengthens the model’s nonlinear representation capabilities.

Table 5. Ablation results of bottleneck architectures.

Deep learning-based object detection and tracking models often suffer from high parameter counts and insufficient real-time performance, particularly on mobile devices such as UAVs. For comparative experiments, this paper exclusively considers algorithms with compact model sizes and excellent real-time capability. Evaluation results on the MDMT validation set (Figure 9) demonstrate that our method achieves higher success rates than DiMP18 (DiMP with ResNet18 backbone), though marginally underperforming the E.T.Tracker, TransT-N2 and LoRAT-L-224.

Figure 9. Threshold-precision curve on MDMT test set.

Comparative experimental results are presented in Figure 10. Experimental results (Figure 10) show that, compared to current mainstream real-time tracking methods, while the proposed approach does not achieve optimal performance in any single metric, it exhibits outstanding comprehensive trade-offs between accuracy and speed. Specifically, compared to the E.T.Tracker method, our approach achieves a 111% frame rate improvement with only a 0.7% loss in AP, increasing the processing speed significantly from 45 FPS to approximately 95 FPS. When compared to Transformer-based TransT methods, the Mamba architecture adopted in our method effectively avoids the quadratic computational complexity of Transformers, providing substantial advantages in computational efficiency. Although there is a slight deficiency in detection accuracy, the significantly improved inference speed makes it particularly suitable for deployment scenarios on resource-constrained edge devices. This balance between accuracy and efficiency holds important practical value for edge computing applications.

Figure 10. Comparative Experimental Results.

To evaluate the model’s robustness, we designed an occlusion experiment (Table 6). The results show that our model achieved an AP of 82.9%, which is 1.7 percentage points lower than the top-performing LoRAT-L-224 model but significantly outperforms traditional methods such as KCF and DiMP18. This outcome demonstrates that our model exhibits strong robustness in handling occlusion challenges.

Table 6. Model performance comparison under occlusion.

To validate the real-time performance and computational resource utilization of the proposed method on edge devices, a Sophgo SE5 edge computing box was deployed at a DJI Airport. The SE5 features an octa-core ARM A53 processor operating at 2.3 GHz, 12 GB RAM, 32 GB eMMC storage, and supports simultaneous 16-channel HD video decoding with intelligent analysis. Additional capabilities include hardware decoding for 38-channel 1080p HD video and 2-channel encoding. The model was first converted to the bmodel format required by Sophgo edge devices, then deployed for object detection and tracking. Resource utilization metrics (Table 7) show our method achieves 39 FPS with 32.8% TPU utilization, demonstrating real-time operation with low computational overhead on the SE5 platform.

Table 7. Resource Utilization.

To further validate the method’s effectiveness in real-world complex scenarios, field tests were conducted using cameras mounted on DJI Dock UAV platforms in low-altitude environments with dense vegetation. During missions, HD video footage captured by UAVs was transmitted in real-time to the dock base station. The dock system then transferred the video stream via internal networks to the deployed Sophgo SE5 edge computing box, which executed the proposed method for real-time processing. Experimental results (Figure 11) demonstrate that our algorithm maintains robust target tracking even in heavily occluded environments with dense foliage cover.

Figure 11. Flight experiments.

To further validate detection and tracking performance under partial occlusion, qualitative comparative experiments were conducted as shown in Figure 12. The KCF algorithm fails to track targets when occlusion exceeds 50%, whereas our method, E.T.Tracker, and DiMP18 maintain stable tracking under such conditions.

Figure 12. Qualitative comparative results. The red boxes in the figure indicate the tracking detection results.

5.2. Evaluation of Multi-UAV Collaborative Tracking Methodology

To validate the efficacy of the proposed multi-UAV collaborative target localization method, three UAVs were statically deployed within a two-dimensional plane spanning −6 m to 6 m along both X- and Y-axes. Each UAV is equipped with GPS modules providing μs-level timestamp synchronization through PPS signals, ensuring coordinated perception across the swarm. An unmanned ground vehicle (UGV) traversing linearly from start to endpoint served as the target. Experimental results (Figure 13) indicate high congruence between predicted and ground-truth trajectories despite minor deviations, confirming substantial agreement between estimated and actual target positions. Further localization error analysis (Figure 14) quantifies discrepancies between collaborative positioning results and ground-truth locations. The error distribution demonstrates that 94.7% of measurements remain below 0.08 m, with sporadic peaks not exceeding 0.10 m. Critically, over 98% of errors are constrained within 10 cm, validating the effectiveness of our multi-UAV collaborative localization framework.

Figure 13. Multi-UAV collaborative target localization results.

Figure 14. Multi-UAV collaborative target localization error.

To validate the performance of the proposed multi-UAV collaborative tracking method, experiments were conducted in complex obstacle-rich scenarios using a moving ground vehicle as the target. Three UAVs were deployed to collaboratively track the vehicle. Figure 15 illustrates positional relationships between the target vehicle and observation UAVs at multiple timestamps, clearly showing light-blue occlusion-free zones and multi-angle coverage formed around the target. Figure 16 presents complete trajectories of both target and UAVs during tracking, while Figure 17 displays actual scene imagery. Results demonstrate that our method effectively overcomes field-of-view limitations of single UAVs in obstructed environments, ensuring continuous target tracking. The obstacle-avoiding trajectories dynamically converging toward target areas confirm the effectiveness and robustness of the collaborative tracking framework in complex scenarios.

Figure 15. Multi-UAV collaborative tracking results.

Figure 16. Multi-UAV collaborative tracking trajectories.

Figure 17. Multi-UAV collaborative tracking flight experiments. The red boxes represent the positions of the UAVs.

Quantitative evaluation further validates these advantages. As shown in Table 8, our method achieves a tracking loss rate of only 3.7% and an average positioning error of 0.03 m in such challenging environments, demonstrating both high tracking reliability and exceptional positioning accuracy.

Table 8. Quantitative evaluation of Multi-UAV Collaborative Tracking.

We compared the optimization performance of CPO and PSO in a multi-UAV cooperative tracking task (Figure 18). The experimental results indicate that PSO exhibits a rapid decline in cost during the initial iterations but enters a plateau around the 400th iteration, eventually converging to a relatively high cost level. In contrast, although CPO shows slower convergence in the early stage, it continues to optimize under the guidance of safety constraints and surpasses PSO in the later iterations, achieving a lower cost value. These results demonstrate that CPO can effectively escape local minima by leveraging constraint mechanisms and find a solution closer to the global optimum, thereby validating its comprehensive advantage in balancing safety and performance in complex multi-UAV cooperative tracking tasks.

Figure 18. Comparison of optimization algorithms in Multi-UAV collaborative tracking.

To evaluate the real-time performance of the algorithms, we conducted a quantitative comparative experiment on running time. The experiment selected 10 different routes and performed tracking control at 100 ms intervals. The average running time of both CPO and PSO algorithms on each route was calculated (Table 9). The experimental results indicate that the average running time of CPO is slightly higher than that of PSO, yet both remain within 100 ms, demonstrating satisfactory real-time computational capability. In complex environments with obstacles, this trade-off of slightly longer computation for a lower-cost, higher-quality path is justified.

Table 9. Comparison experimental of average running time.

6. Conclusions

In this work, we propose an occlusion-aware multi-UAV collaborative target tracking framework for low-altitude dense obstacle environments. First, an occlusion-robust tracking method based on a Mamba backbone network is developed, integrating a Dilated Wave-based DWRFEM and dual-branch feature refinement framework. This enhances local feature extraction and multi-scale contextual fusion capabilities, effectively addressing discontinuous contours and feature degradation under severe occlusion. Additionally, a multi-UAV collaborative system is constructed to achieve cooperative target localization through multi-UAV ray intersection, generate occlusion-free viewpoints via obstacle-density-adaptive spherical sampling, and optimize flight trajectories using the CPO. Experimental results on the MDMT dataset demonstrate that the proposed tracker achieves a high processing speed while maintaining competitive accuracy, attaining a superior balance between speed and precision compared to other models, which makes it well-suited for edge deployment. Field tests confirm that the multi-UAV strategy successfully mitigates detection failures and tracking losses in complex low-altitude environments, leading to enhanced overall system reliability.

Despite the promising results, this work has certain limitations that warrant discussion. Our multi-UAV cooperative framework currently relies on stable and low-latency communication links for effective coordination. While this assumption holds in our controlled experimental setup with proximal devices, practical deployments in vast or complex environments may suffer from packet loss, significant delays, or intermittent interruptions due to obstacles and electromagnetic interference. These real-world communication challenges could degrade system synchronization and coordination performance. Additionally, the trajectory planning module operates based on a pre-perceived or known static obstacle density map. Consequently, the algorithm’s adaptability in highly dynamic environments—where obstacles may move unpredictably (e.g., vehicles, birds, or other UAVs)—remains a challenge. The current approach may not react sufficiently fast to sudden environmental changes, potentially leading to suboptimal or unsafe paths.

Future research will focus on enhancing communication robustness through delay-tolerant networking protocols and developing real-time dynamic obstacle prediction mechanisms to improve planning reactivity and safety in fully unknown and dynamic scenarios.

Author Contributions

Conceptualization, Y.A., R.L., C.X. and X.L.; methodology, C.X.; software, R.L.; validation, Y.A., R.L. and C.X.; formal analysis, Y.A.; investigation, C.X.; resources, Y.A.; data curation, R.L.; writing—original draft preparation, R.L. and X.L.; writing—review and editing, Y.A., R.L. and C.X.; visualization, R.L.; supervision, Y.A.; project administration, Y.A.; funding acquisition, Y.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Program of Hunan Province, grant number 2024JK2083 and Xiangjiang Laboratory Major Projects, grant number 24XJ01002.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

San, C.T.; Kakani, V. Smart Precision Weeding in Agriculture Using 5IR Technologies. Electronics 2025, 14, 2517. [Google Scholar] [CrossRef]
Li, J.; Hua, Y.; Xue, M. MSO-DETR: A Lightweight Detection Transformer Model for Small Object Detection in Maritime Search and Rescue. Electronics 2025, 14, 2327. [Google Scholar] [CrossRef]
Ouyang, Y.; Liu, W.; Yang, Q.; Mao, X.; Li, F. Trust Based Task Offloading Scheme in UAV-Enhanced Edge Computing Network. Peer-to-Peer Netw. Appl. 2021, 14, 3268–3290. [Google Scholar] [CrossRef]
Dong, L.; Liu, Z.; Jiang, F.; Wang, K. Joint Optimization of Deployment and Trajectory in UAV and IRS-Assisted IoT Data Collection System. IEEE Internet Things J. 2022, 9, 21583–21593. [Google Scholar] [CrossRef]
Jiang, F.; Peng, Y.; Wang, K.; Dong, L.; Yang, K. MARS: A DRL-Based Multi-Task Resource Scheduling Framework for UAV with IRS-Assisted Mobile Edge Computing System. IEEE Trans. Cloud Comput. 2023, 11, 3700–3712. [Google Scholar] [CrossRef]
Jiang, F.; Wang, K.; Dong, L.; Pan, C.; Xu, W.; Yang, K. AI Driven Heterogeneous MEC System with UAV Assistance for Dynamic Environment: Challenges and Solutions. IEEE Netw. 2020, 35, 400–408. [Google Scholar] [CrossRef]
Chen, G.; Zhu, P.; Cao, B.; Wang, X.; Hu, Q. Cross-Drone Transformer Network for Robust Single Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4552–4563. [Google Scholar] [CrossRef]
Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]
Chinthi-Reddy, S.R.; Lim, S.; Choi, G.S.; Chae, J.; Pu, C. DarkSky: Privacy-Preserving Target Tracking Strategies Using a Flying Drone. Veh. Commun. 2022, 35, 100459. [Google Scholar] [CrossRef]
Wang, K.; Yu, X.; Yu, W.; Li, G.; Lan, X.; Ye, Q.; Jiao, J.; Han, Z. ClickTrack: Towards Real-Time Interactive Single Object Tracking. Pattern Recogn. 2025, 161, 111211. [Google Scholar] [CrossRef]
Lopez-Sanchez, I.; Moreno-Valenzuela, J. PID Control of Quadrotor UAVs: A Survey. Annu. Rev. Control. 2023, 56, 100900. [Google Scholar] [CrossRef]
Song, Y.; Scaramuzza, D. Policy Search for Model Predictive Control with Application to Agile Drone Flight. IEEE Trans. Robot. 2022, 38, 2114–2130. [Google Scholar] [CrossRef]
Saleem, O.; Kazim, M.; Iqbal, J. Robust Position Control of VTOL UAVs Using a Linear Quadratic Rate-Varying Integral Tracker: Design and Validation. Drones 2025, 9, 73. [Google Scholar] [CrossRef]
Dang, Z.; Sun, X.; Sun, B.; Guo, R.; Li, C. OMCTrack: Integrating Occlusion Perception and Motion Compensation for UAV Multi-Object Tracking. Drones 2024, 8, 480. [Google Scholar] [CrossRef]
Chang, Y.; Zhou, H.; Wang, X.; Shen, L.; Hu, T. Cross-Drone Binocular Coordination for Ground Moving Target Tracking in Occlusion-Rich Scenarios. IEEE Robot. Autom. Lett. 2020, 5, 3161–3168. [Google Scholar] [CrossRef]
Hansen, J.G.; de Figueiredo, R.P. Active Object Detection and Tracking Using Gimbal Mechanisms for Autonomous Drone Applications. Drones 2024, 8, 55. [Google Scholar] [CrossRef]
Meibodi, F.A.; Alijani, S.; Najjaran, H. A Deep Dive into Generic Object Tracking: A Survey. arXiv 2025, arXiv:2507.23251. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual Object Tracking Using Adaptive Correlation Filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2544–2550. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the Circulant Structure of Tracking-by-Detection with Kernels. In Proceedings of the European Conference on Computer Vision, Florence Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef]
Yang, S. A Novel Study on Deep Learning Framework to Predict and Analyze the Financial Time Series Information. Future Gener. Comput. Syst. 2021, 125, 812–819. [Google Scholar] [CrossRef]
Zhang, P.; Liu, X.; Li, W.; Yu, X. Pharmaceutical Cold Chain Management Based on Blockchain and Deep Learning. J. Internet Technol. 2021, 22, 1531–1542. [Google Scholar] [CrossRef]
Shi, D.; Zheng, H. A Mortality Risk Assessment Approach on ICU Patients Clinical Medication Events Using Deep Learning. Comput. Model. Eng. Sci. 2021, 128, 161–181. [Google Scholar] [CrossRef]
Tong, Y.; Sun, W. The Role of Film and Television Big Data in Real-Time Image Detection and Processing in the Internet of Things Era. J. Real-Time Image Process. 2021, 18, 1115–1127. [Google Scholar] [CrossRef]
Zhou, W.; Zhao, Y.; Chen, W.; Liu, Y.; Yang, R.; Liu, Z. Research on Investment Portfolio Model Based on Neural Network and Genetic Algorithm in Big Data Era. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 228. [Google Scholar] [CrossRef]
Zeng, Y.; Ouyang, S.; Zhu, T.; Li, C. E-Commerce Network Security Based on Big Data in Cloud Computing Environment. Mob. Inf. Syst. 2022, 2022, 9935244. [Google Scholar] [CrossRef]
Tao, R.; Gavves, E.; Smeulders, A.W. Siamese Instance Search for Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
Lin, L.; Fan, H.; Zhang, Z.; Wang, Y.; Xu, Y.; Ling, H. Tracking Meets Lora: Faster Training, Larger Model, Stronger Performance. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 300–318. [Google Scholar]
Kopyt, A.; Narkiewicz, J.; Radziszewski, P. An Unmanned Aerial Vehicle Optimal Selection Methodology for Object Tracking. Adv. Mech. Eng. 2018, 10, 1–12. [Google Scholar] [CrossRef]
Gong, K.; Cao, Z.; Xiao, Y.; Fang, Z. Abrupt-Motion-Aware Lightweight Visual Tracking for Unmanned Aerial Vehicles. Vis. Comput. 2021, 37, 371–383. [Google Scholar] [CrossRef]
Lin, C.; Zhang, W.; Shi, J. Tracking Strategy of Unmanned Aerial Vehicle for Tracking Moving Target. Int. J. Control Autom. Syst. 2021, 19, 2183–2194. [Google Scholar] [CrossRef]
Lee, K.; Chang, H.J.; Choi, J.; Heo, B.; Leonardis, A.; Choi, J.Y. Motion-Aware Ensemble of Three-Mode Trackers for Unmanned Aerial Vehicles. Mach. Vis. Appl. 2021, 32, 54. [Google Scholar] [CrossRef]
Campos-Martínez, S.-N.; Hernández-González, O.; Guerrero-Sánchez, M.-E.; Valencia-Palomo, G.; Targui, B.; López-Estrada, F.-R. Consensus Tracking Control of Multiple Unmanned Aerial Vehicles Subject to Distinct Unknown Delays. Machines 2024, 12, 337. [Google Scholar] [CrossRef]
Zhou, Z.; Hu, J.; Chen, B.; Shen, X.; Meng, B. Target Tracking and Circumnavigation Control for Multi-Unmanned Aerial Vehicle Systems Using Bearing Measurements. Actuators 2024, 13, 323. [Google Scholar] [CrossRef]
Zhang, C.; Wang, Y.; Zheng, W. Multi-UAVs Tracking Non-Cooperative Target Using Constrained Iterative Linear Quadratic Gaussian. Drones 2024, 8, 326. [Google Scholar] [CrossRef]
Upadhyay, J.; Rawat, A.; Deb, D. Multiple Drone Navigation and Formation Using Selective Target Tracking-Based Computer Vision. Electronics 2021, 10, 2125. [Google Scholar] [CrossRef]
Liu, Y.; Li, X.; Wang, J.; Wei, F.; Yang, J. Reinforcement-Learning-Based Multi-Uav Cooperative Search for Moving Targets in 3D Scenarios. Drones 2024, 8, 378. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 363–380. [Google Scholar]
Zhong, S.; Wen, W.; Qin, J. Mix-Pooling Strategy for Attention Mechanism. arXiv 2022, arXiv:2208.10322. [Google Scholar] [CrossRef]
Kang, B.; Chen, X.; Wang, D.; Peng, H.; Lu, H. Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9612–9621. [Google Scholar]
Chelly, I.; Finder, S.E.; Ifergane, S.; Freifeld, O. Trainable Highly-Expressive Activation Functions. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 200–217. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Liu, Z.; Shang, Y.; Li, T.; Chen, G.; Wang, Y.; Hu, Q.; Zhu, P. Robust Multi-Drone Multi-Target Tracking to Resolve Target Occlusion: A Benchmark. IEEE Trans. Multimed. 2023, 25, 1462–1476. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed occlusion-robust target tracking methodology. The red boxes in the figure indicate the tracking detection results.

Figure 2. Mamba backbone network architecture.

Figure 3. VSS block networks architecture.

Figure 4. DWRFEM networks architecture.

Figure 5. WDSDConv architecture. WT stands for wavelet TRANSFORM and IWT stands for inverse wavelet transform.

Figure 6. SPPF networks architecture.

Figure 7. Bottleneck layer networks architecture.

Figure 8. KAN architecture.

Figure 9. Threshold-precision curve on MDMT test set.

Figure 10. Comparative Experimental Results.

Figure 11. Flight experiments.

Figure 12. Qualitative comparative results. The red boxes in the figure indicate the tracking detection results.

Figure 13. Multi-UAV collaborative target localization results.

Figure 14. Multi-UAV collaborative target localization error.

Figure 15. Multi-UAV collaborative tracking results.

Figure 16. Multi-UAV collaborative tracking trajectories.

Figure 17. Multi-UAV collaborative tracking flight experiments. The red boxes represent the positions of the UAVs.

Figure 18. Comparison of optimization algorithms in Multi-UAV collaborative tracking.

Table 1. Ablation results of backbone networks.

Model	FPS	AP
ResNet18	85	79.1%
ResNet50	35	83.8%
Mamba	95	84.1%

Table 2. Ablation results of the VSS block.

Model	FPS	AP
DWConv	103	81.1%
DWRFEM	95	84.1%

Table 3. Ablation results of the mixpooling SPPF.

Model	FPS	AP
None	95	82.4%
SPPF	95	84.1%

Table 4. Ablation results of activation functions.

Model	FPS	AP
Hardswish	95	82.9%
DiTAC	95	84.1%

Table 5. Ablation results of bottleneck architectures.

Model	FPS	AP
MLP	95	83.2%
KAN	95	84.1%

Table 6. Model performance comparison under occlusion.

Model	Ours	E.T.Tracker	KCF	DiMP18	TransT-N2	LoRAT-L-224
AP	82.9%	83.1%	47.9%	71.3%	84.5%	84.6%

Table 7. Resource Utilization.

Metrics
Frame Rate	39 FPS
TPU Utilization	32.8%
CPU Utilization	19.8%
TPU Memory Usage	624 M
CPU Memory Usage	542 M

Table 8. Quantitative evaluation of Multi-UAV Collaborative Tracking.

Tracking Loss Rate	Average Positioning Error
3.7%	0.031 m

Table 9. Comparison experimental of average running time.

Route	Total_Times	CPO	PSO
1	58 s	79 ms	58 ms
2	70 s	77 ms	53 ms
3	60 s	78 ms	58 ms
4	53 s	74 ms	52 ms
5	59 s	78 ms	51 ms
6	63 s	70 ms	47 ms
7	55 s	80 ms	44 ms
8	64 s	71 ms	56 ms
9	50 s	80 ms	55 ms
10	58 s	78 ms	42 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Real-Time Occluded Target Detection and Collaborative Tracking Method for UAVs

Abstract

1. Introduction

Key Points of the Article

2. Related Works

2.1. Single-Object Tracking

2.2. UVA Tracking Control

2.3. Motivation of the Article

3. Occlusion-Robust Target Tracking Methodology

3.1. Mamba Backbone

3.1.1. VSS Block

3.1.2. MixPooling SPPF

3.2. Bottleneck Layer

3.2.1. Feature Extraction with MHA

3.2.2. Feature Enhancement and Fusion with KAN

4. Multi-Perspective Collaborative Tracking with Multi-UAV

4.1. Collaborative Target Localization with Multi-UAV

4.2. Spherical Visibility Sampling with Viewpoint Optimization

4.2.1. Line-of-Sight Reachability Detection

4.2.2. Visibility Volume Fusion

4.2.3. Adaptive Sampling Optimization

4.3. Multi-UAV Trajectory Planning

5. Experiments

5.1. Evaluation of Occlusion-Robust Target Tracking Methodology

5.2. Evaluation of Multi-UAV Collaborative Tracking Methodology

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics