Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems

Chen, Yang; Du, Binhan; Wu, Tao

doi:10.3390/drones9090612

Open AccessArticle

Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems

by

Yang Chen

,

Binhan Du

and

Tao Wu

^*

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 612; https://doi.org/10.3390/drones9090612

Submission received: 20 June 2025 / Revised: 18 August 2025 / Accepted: 26 August 2025 / Published: 30 August 2025

Download

Browse Figures

Versions Notes

Abstract

In air–ground cooperative systems, identifying the identities of unmanned ground vehicles (UGVs) from an unmanned aerial vehicle (UAV) perspective is a critical step for downstream tasks. Traditional approaches involving attaching markers, like AprilTags on UGVs, fail under low-resolution or occlusion conditions, and the visually identical UGVs are hard to distinguish through similar visual features. This paper proposes a markerless method that associates UGV onboard sensor data with UAV visual detections to achieve identification. Our approach employs a Dempster–Shafer fused methodology integrating two proposed complementary association techniques: a projection-based method exploiting sequential motion patterns through reprojection error validation, and a topology-based method constructing distinctive topology using positional and orientation data. The association process is further integrated into a multi-object tracking framework to reduce ID switches during occlusions. Experiments demonstrate that under low-noise conditions, the projection-based method and the topology-based method achieves association precision at 89.5% and 87.6% respectively, which is superior to the previous methods. The fused approach enables robust association at 79.9% precision under high noise conditions, nearly 10% higher than original performance. Under false detection scenarios, our method achieves effective false-positive exclusion, and the integrated tracking process effectively mitigates occlusion-induced ID switches.

Keywords:

multi-target association; target identification; air–ground cooperative systems; multi-object tracking

1. Introduction

With the rapid development of autonomous perception and planning technologies for unmanned platforms, these systems are playing increasingly vital roles in military and civilian domains [1,2,3]. Specifically, through collaborative modes involving multiple UGVs, multiple UAVs, and air–ground cooperation, complex tasks traditionally executed by single unmanned systems can be decoupled into more modular and unitized subtasks [4,5,6]. This approach not only significantly reduces task execution complexity but also enhances system flexibility and robustness.

Current mainstream air–ground collaboration includes various forms such as single UAV cooperating with a UGV swarm [6,7], as well as multi-UAV systems performing collaborative tasks with ground vehicles [8], leveraging the UAVs’ wide field of view to assist ground vehicles in downstream tasks. Due to advancements in modern target detection algorithms, real-time detection of UGV targets within the UAV’s field of view has become feasible and efficient.

However, beyond detection, this configuration faces another challenge: the identity identification of the UGVs within the UAV’s visual field when they share identical or highly similar physical appearances [9]. While visual homogeneity among UGVs facilitates manufacturing and operational standardization, it complicates individual distinction from aerial perspectives. Most critically, this identification ambiguity can lead to downstream task failures where incorrect targets identification may result in unexpected incidents.

To address this challenge and establish correspondence between UGV targets in the UAV’s view and their identity numbers, a straightforward solution involves attaching AprilTags to UGVs for direct numbering identification. For instance, installing AprilTags [10,11] on vehicle roofs enables UAV to decode vehicle numbers through polygon analysis of edge features. However, this method faces limitations during high-altitude UAV operations due to insufficient resolution for AprilTag recognition and occlusion-induced identification failures during UGV movement [12], resulting in unstable vehicle identification. Overall, explicit markers like AprilTags impose constraints on swarm control systems. Additionally, conventional multi-target tracking algorithms often rely on inter-frame appearance similarity for association [13], achieving stable tracking performance. However, such approaches yield minimal improvement when tracking visually homogeneous UGVs.

To address the limitations of traditional methods and challenges posed by visually identical UGVs, we reformulate the identification problem in UAV imagery as a cross-coordinate association task, as illustrated in Figure 1. In an air–ground cooperative system, onboard sensor data (position, orientation) from the UGV swarm are accessible. By correlating these transmitted sensor data with visual cues of UGVs detected in the UAV’s image coordinate, we resolve the identification of appearance-identical vehicles despite their visual homogeneity.

On the one hand, we propose a method to construct topologies in both world and image coordinate systems using target positions and angles, employing the Hungarian algorithm [14] for node matching to obtain single-frame association results. On the other hand, inspired by UAV landmark localization methods [7], we introduce a projection-based method that constructs world-image coordinate pairs from positional data, computes projection matrices for all possible associations, and evaluates re-projection errors in subsequent frames to determine sequential associations.

In the end, we achieve enhanced robustness by calculating confidence matrices for both methods and applying Dempster–Shafer theory for decision-level fusion [15]. Furthermore, we integrate our association algorithm into existing two-stage tracking frameworks, enabling robust tracking without relying on appearance features, and reducing ID switch occurrences during occlusions.

Overall, the main contributions of this paper are as follows:

A markerless identification framework that correlates UGVs sensor data with UAV visual detection results, enabling reliable distinction of visually identical UGVs without physical modifications.
A decision-level method to integrate the association results and achieve improved accuracy and noise robustness, with evaluation under comprehensive simulations with diverse motion patterns and noise scenarios.
An enhanced multi-object tracking architecture that reduces ID-switch rates utilizing the above sensor-visual association results, which remains effective in occlusion scenarios.

The structure of this paper is organized as follows: Section 2 reviews related works, analyzing the research focus and limitations of existing methods to highlight the necessity and advantages of our approach. Section 3 provides a detailed exposition of the proposed association methods. Section 4 implements the proposed method through comprehensive simulations under varying conditions, with comparative analysis against the existing approaches, followed by validation in physical experiments. Section 5 concludes this paper and discusses potential directions.

2. Related Work

2.1. Target Identification

In the context of UAV-based target identification, conventional approaches predominantly rely on visual discriminability. Foundational works in object detection including YOLO series [16], Mask R-CNN [17], established CNN-based frameworks for appearance feature extraction, while subsequent advances in classification networks [18,19,20] enhanced discriminative capabilities through hierarchical feature learning. Further developments in fine-grained classification networks [21] enabled the feasibility of target discrimination based on subtle visual distinctions. However, these methodologies encounter inherent limitations when processing visually identical targets—a fundamental challenge that our air–ground cooperative system specifically addresses.

For marker-based solutions, fiducial marker systems like AprilTags [10] have been widely adopted for robotic identification. AprilTag2 [11] significantly advanced the paradigm by optimizing the detector’s computational efficiency while achieving higher detection rates through adaptive thresholding and quad decoding improvements. Ref. [12] proposed Aruco, which specifically addressed occlusion challenges by introducing error-correcting codes and partial marker recognition, enabling moderate occlusion robustness. Ref. [22] proposed a nested AprilTag configuration with different sizes to maintain visibility during UAV descent, along with an HSV-based image preprocessing method to mitigate shadow occlusion effects. However, these improvements remain ineffective under low-resolution conditions, a fundamental limitation for aerial observation scenarios. This resolution-dependency further motivates our markerless approach leveraging onboard sensor information.

In summary, conventional approaches relying on visual discriminability and marker-based systems have advanced target identification, but both paradigms face inherent limitations when applied to visually identical UGVs in aerial scenarios.

2.2. Multi-Target Association

Multi-target association across different frames within the same image coordinate system has been a key problem in Multi-Object Tracking (MOT) research. SORT [23] integrated Kalman filtering with the Hungarian algorithm, achieving real-time monitoring and establishing the Detection-Based Tracking (DBT) paradigm as a dominant approach. DeepSORT [13] introduced appearance feature cosine distance to mitigate missed detections by leveraging appearance cues within bounding boxes. Subsequent studies, such as BoTSORT [24], StrongSORT [25], and BYTETrack [26], further refined the association pipeline within the DeepSORT framework. Recent breakthroughs in end-to-end association methods include FairMOT [27], which employs an anchor-free architecture to jointly learn object detection and re-identification features, achieving superior inter-frame multi-target association. TransTrack [28] adopts the DETR [29] paradigm, maintaining a set of trajectory queries to encode historical trajectory information while training learnable query vectors. Through cross-attention with current frame features, it achieves target association. However, fundamentally, these MOT approaches heavily depend on appearance distinctiveness, rendering them unsuitable for visually identical targets, which is a challenge emphasized by the DanceTrack dataset [30], which focuses on tracking uniformly appearing objects with diverse motions. To address this, the C-BIoU tracker [31] uses Buffered IoU to expand the matching space for non-overlapping detections caused by irregular motions and employs cascaded matching to avoid over-expansion, significantly improving association of such targets. However, C-BIoU tracker may not be robust to some noisy detections.

Beyond single-camera target-association between frames, multi-target association across different image coordinate systems has also garnered significant attention. Similar to single-camera association, many studies leverage appearance features for cross-camera target matching. Ref. [32] employed a Siamese network to process vehicle license plates and body features, deriving similarity metrics for cross-camera vehicle association. Ref. [33] proposed a distributed multi-camera association method, utilizing the Hungarian algorithm for cross-camera association while incorporating appearance constraints and spatial overlap for data association.

In addition to appearance features, topological features have been widely explored. Ref. [34] pioneered data association based on target Topology, eliminating reliance on a unified coordinate system. Ref. [35] addressed trajectory association by constructing triangular topological relationships. Based on this, ref. [36] introduced Delaunay triangulation to derive non-empirical features, combining a triangular topology similarity (TTS) method with a globally consistent two-step association strategy to resolve multi-target association in multi-UAV scenarios. Furthermore, ref. [37] proposed a topological sequence association method based on the one-dimensional position distribution of multiple targets, characterizing visual sensors in air–ground systems and applying it to such systems for the first time. However, existing topology-based solutions fail to utilize potential orientation and temporal information.

Further advancements involve both of topological and appearance features. In ref. [9], vehicle association was considered a graph matching problem without prior UAV position knowledge. High-confidence targets were selected as graph nodes, with spatial topological relationships serving as edges. A graph feature network fused appearance and spatial features and performed local subgraph matching before extending local matches to all targets. Ref. [38] introduced a multi-pedestrian association method in air–ground cooeperative cameras. It simultaneously leverages appearance features for temporal tracking, and spatial topology for cross-view matching. The core innovation lies in geometry-based spatial distribution matching, representing pedestrians as normalized position-depth vectors and integrating features through mixed-integer programming.

In summary, prior research on cross-coordinate association has predominantly focused on the extraction and matching of appearance features, which become ineffective in scenarios involving visually identical targets. Furthermore, existing topology-based association methods primarily rely on positional information while neglecting orientation cues. Additionally, most conventional matching approaches operate on a single-frame basis, failing to leverage sequential information. These critical gaps, specifically the lack of integration between orientation awareness and sequential motion patterns, and the failure to bridge UGV sensor data with UAV visual detections, have prevented the development of a unified framework for identifying visually identical targets in air–ground cooperative systems. To address these limitations, this paper proposes a series of novel methods that incorporate sequential information and orientation awareness, representing the first attempt to integrate these complementary cues for robust association in air–ground scenarios.

3. Methods

3.1. Overall Framework

This paper proposes a unified association framework that fuses projection-based and topology-based methods to generate robust association results. As illustrated in Figure 2, the projection-based approach primarily leverages sequential motion information from UGV sensor data, while the topology-based method emphasizes relative positioning and orientation relationships among targets within a single frame. A Dempster–Shafer fusion module integrates both association confidences at the decision level. The framework further enhances the original MOT algorithm based on our proposed projection-based method, effectively reducing ID switches in occlusion scenarios.

3.2. Projection-Based Association Method

In landmark-based UAV localization researches, projection matrices between the UAV’s image coordinate system and world coordinate system are typically computed using multiple world-image coordinate pairs of ground targets to achieve localization. Inspired by this approach, this study employs advanced object detection and tracking algorithms to continuously acquire sequential position coordinates of UGVs in images. Those coordinates are forming coordinate pairs with the relative coordinate system coordinates, which is fed back by positioning sensors. It should be noted that all the coordinate pairs are coupled under all possible association results.

Adopting the least-squares method, different projection matrices are calculated using all possible coordinate pairs, which are then used to reproject future sensor-returned relative coordinates into the image coordinate system. After performing error statistics between the reprojected image coordinates from different projection matrices and the actual image coordinates in future frames, the association result with the minimum reprojection error is considered as the correct output.

Figure 3 shows the overall workflow of this projection-based method. Assume that in time frame t, the image coordinates of the i-th UGV are

y_{i}^{t}

, where

y_{i}^{t} = (u_{i}, v_{i})

. Let the set

{y_{i}^{t}}_{i = 1}^{n_{i m}}

represent the collection of image coordinates for all

n_{i m}

UGVs in the time frame t. Similarly,

{x_{i}^{t}}_{i = 1}^{n_{r l}}

represents the set of relative coordinate system coordinates for all

n_{r l}

UGVs at time t. The relative coordinates of the i-th unmanned vehicle,

x_{i}^{t} = (a_{i}, b_{i}, c_{i})

, are obtained by subtracting the world coordinate system coordinates of the UAV from the world coordinate system coordinates of the unmanned vehicle.

Subsequently, for each frame, we generate all possible association results

A_{n_{r l}}^{n_{i m}}

as sets of relative-image coordinate pairs

{X_{i}^{t}, Y_{i}^{t}}_{i = 1}^{A_{n_{r l}}^{n_{i m}}}

. For example, when

n_{r l} = 3

and

n_{i m} = 2

, there are

A_{3}^{2} = 6

possible association results, as presented in Equation (1).

\{\begin{matrix} (X_{1}^{t}, Y_{1}^{t}) & = ((x_{1}^{t}, x_{2}^{t}), (y_{1}^{t}, y_{2}^{t})) \\ (X_{2}^{t}, Y_{2}^{t}) & = ((x_{2}^{t}, x_{1}^{t}), (y_{1}^{t}, y_{2}^{t})) \\ (X_{3}^{t}, Y_{3}^{t}) & = ((x_{2}^{t}, x_{3}^{t}), (y_{1}^{t}, y_{2}^{t})) \\ (X_{4}^{t}, Y_{4}^{t}) & = ((x_{3}^{t}, x_{2}^{t}), (y_{1}^{t}, y_{2}^{t})) \\ (X_{5}^{t}, Y_{5}^{t}) & = ((x_{1}^{t}, x_{3}^{t}), (y_{1}^{t}, y_{2}^{t})) \\ (X_{6}^{t}, Y_{6}^{t}) & = ((x_{3}^{t}, x_{1}^{t}), (y_{1}^{t}, y_{2}^{t})) \end{matrix}

(1)

Merge the coordinate pair sets from historical frames 1 to t into

{X_{i}, Y_{i}}_{i = 1}^{A_{n_{r l}}^{n_{i m}}}

. Using the least squares method, solve for the projection matrices set

{θ_{i}}_{i = 1}^{A_{n_{r l}}^{n_{i m}}}

corresponding to each possible association result, where

{Y_{i} = X_{i} θ_{i}}_{i = 1}^{A_{n_{r l}}^{n_{i m}}}

. To illustrate, taking

(X_{2}, Y_{2})

from the previous example, the solution for

θ_{2}

is shown in Equation (2).

\begin{matrix} X_{2} & = [\begin{matrix} x_{2}^{1} & 1 \\ x_{1}^{1} & 1 \\ ⋮ & 1 \\ x_{2}^{t} & 1 \\ x_{1}^{t} & 1 \end{matrix}] Y_{2} = [\begin{matrix} y_{1}^{1} & 1 \\ y_{2}^{1} & 1 \\ ⋮ & 1 \\ y_{1}^{t} & 1 \\ y_{2}^{t} & 1 \end{matrix}] \\ θ_{2} = {(X_{2}^{T} X_{2})}^{- 1} X_{2}^{T} Y_{2} \end{matrix}

(2)

Finally, we use the solved matrices set

{θ_{i}}_{i = 1}^{A_{n_{r l}}^{n_{i m}}}

to re-project the set of relative coordinates

{x_{i}^{t + 1}}_{i = 1}^{n_{r l}}

fed back by the UGVs at time

t + 1

. This results in

A_{n_{r l}}^{n_{i m}}

sets of possible target image coordinates projected onto the image coordinate system, denoted as

{{\hat{y}}_{i}^{t + 1}}_{i = 1}^{n_{i m}}

. By comparing these with the actual image coordinates

{y_{i}^{t + 1}}_{i = 1}^{n_{i m}}

, we consider the association result corresponding to the projection matrix with the minimum re-projection error as the association output.

Furthermore, we can extend more future frames to repeat the re-projection process. By counting the association results output for each future frame, we can select the most frequent association result as the final output. This approach allows us to achieve a higher degree of association confidence.

It is noteworthy that our projection-based method remains effective for scenarios where

n_{i m} > n_{r l}

. When false detections exist in the UAV imagery, resulting in

n_{i m} > n_{r l}

(i.e., there exist targets that should not be associated), the number of possible projection matrices becomes

A_{n_{i m}}^{n_{r l}}

, rather than

A_{n_{r l}}^{n_{i m}}

. Nevertheless, the subsequent processing still follows the original workflow illustrated in Figure 3. In Section 4, we will further analyze the performance of our projection-based algorithm under false detection conditions.

3.3. Topology-Based Association Method

The projection-based method focuses mainly on sequential positional information of UGVs, which reflects their temporal motion patterns to establish associations. We further propose a topology-based method to specifically address the relative spatial relationships among UGV targets. In previous studies on cross-view and cross-platform target association based on topological structures, the positions of multiple objects are utilized to construct topological structures, such as triangular topologies. The association results between object nodes across different topological structures are determined by comparing the similarities between these topologies. However, many prior studies overlooked the orientation information of multiple objects, which plays a significant role in the construction and correlation of topological structures.

Due to the high similarity in appearance features among UGVs, it is impractical to distinguish UGV targets solely based on appearance characteristics. Instead, we leveraged the appearance similarity among UGV targets, requiring only a single ResNet classification network model to classify the direction of the oriented targets returned by the rotated object detection model. This method proposes to enhance the association precision by constructing topological structures that incorporate not only the positional information of targets but also their orientation.

Figure 4 illustrates the process employed in our approach to determine the orientation of objects within the image and the direction of the UGVs’ front. During the training process, after collecting the dataset to train the rotated object detection model, the training images for the object direction classification network can be obtained directly by cropping the bounding boxes from the rotated object detection annotations. An unsupervised clustering method is applied to partition the high-level features extracted by a pretrained ResNet-18 network, resulting in an auto-annotated object direction classification dataset. This dataset is then used to train the ResNet-18 classification network after some manual corrections. In the inference process, UAV perspective images are processed through the rotated object detection model, the oriented bounding boxes are then cropped and fed into the object direction classification model, outputting the coordinates and orientation information of the UGVs.

Figure 5 illustrates the angular topology constructed in our topology-based association method. In the world coordinate system, let

{\vec{v}}_{0}^{w d}

denote the orientation vector of the UAV, where

{\vec{v}}_{i}^{w d}

represents the orientation vector of the i-th unmanned vehicle, and

{\vec{v}}_{0 i}^{w d}

signifies the vector extending from the UAV’s position to the location of the i-th unmanned vehicle. Similarly, after obtaining visual cues for the orientation and front direction of unmanned vehicles in the image, we define

{\vec{v}}_{0}^{i m}

as the UAV’s orientation vector, assumed to be pointing upwards from the image center. The orientation vector of the j-th unmanned vehicle, as determined by the rotated object detection model and the vehicle front direction classification model, is denoted as

{\vec{v}}_{j}^{i m}

. Finally,

{\vec{v}}_{0 j}^{i m}

represents the vector extending from the center point to the image location of the j-th unmanned vehicle.

Subsequently, in the world coordinate system, we utilize the law of cosines to compute the angles between each vector

{\vec{v}}_{0 - i}^{w d}

and the vectors

{\vec{v}}_{0}^{w d}

,

{\vec{v}}_{i}^{w d}

. These angles are denoted as

α_{i}^{w d}

and

β_{i}^{w d}

, respectively. Analogously, in the image coordinate system, we derive

α_{j}^{i m}

and

β_{j}^{i m}

using similar computations. We propose to consider the pair

(α_{i}, β_{i})

as the state variables for each target in their respective coordinate systems.

To evaluate the topology association between targets in the world and image coordinate systems, we construct a cost matrix C where each element

c_{i j}

represents the dissimilarity between the state of target i in the world coordinate system and target j in the image coordinate system. This can be formulated as:

c_{i j} = | | (\begin{matrix} α_{i}^{w d}, β_{i}^{w d} \end{matrix}) - (\begin{matrix} α_{j}^{i m}, β_{j}^{i m} \end{matrix}) {| |}_{2}

(3)

where

0 \leq i \leq n_{r l}

,

0 \leq j \leq n_{i m}

, and

{∥ \cdot ∥}_{2}

denote the Euclidean norm. Given this cost matrix, we employ the Hungarian algorithm to determine the optimal assignment that minimizes the total cost. The final assignment reveals the correlation between the imagery targets and the actual UGVs identity.

3.4. Decision-Level Fusion Based on Dempster–Shafer Method

Based on the projection-based and topology-based association methods, we further propose a confidence calculation for them, and utilize Dempster–Shafer method to fuse their results in decision-level. In the projection-based association method, we assume that the projection matrix corresponding to the final association result is

θ_{m}

. Our hypothesis posits that when the credibility of the association result is sufficiently high, its corresponding projection matrix will yield smaller errors during reprojection. Specifically, the association confidence is considered higher when the reprojected image coordinates are in closer proximity to the actual image coordinates of their associated targets and simultaneously maintain a greater distance from the coordinates of other targets. Let the reprojection error matrix be denoted as

E_{n_{i m} \times n_{r l}} = [e_{i j}]

, where:

e_{i j} = | {\hat{y}}_{i}^{t + 1} - y_{j}^{t + 1} |

(4)

Based on this, we define the confidence matrix

M_{n_{i m} \times n_{r l}}^{P} = [m_{i j}^{P}]

for the projection-based association results as follows:

M_{n_{i m} \times n_{r l}}^{P} = RowNorm (1 - MinMaxNorm (E_{n_{i m} \times n_{r l}}))

(5)

where

MinMaxNorm

represents the min-max normalization operation, defined as:

MinMaxNorm (x) = \frac{x - min (x)}{max (x) - min (x)}

(6)

This operation maps the elements in the error matrix E to the interval

[0, 1]

.

RowNorm

denotes the row normalization operation, defined as:

RowNorm (X_{i j}) = \frac{X_{i j}}{\sum_{k} X_{i k}}

(7)

This operation ensures that the sum of each row equals 1, thus constraining the confidence distribution for each target within the

[0, 1]

interval with a sum of 1. Through this methodology, we can quantify the credibility of association results. For the topology-based association method, we directly utilize the cost matrix C. The confidence matrix

M_{n_{i m} \times n_{r l}}^{T} = [m_{i j}^{T}]

is computed as follows:

M_{n_{i m} \times n_{r l}}^{T} = RowNorm (1 - MinMaxNorm (C))

(8)

After obtaining the confidence matrices from both projection-based and topology-based methods, we employ a fusion strategy based on the Dempster–Shafer theory to combine this evidence. The fusion process is designed to leverage the strengths of both association methods, potentially leading to more robust and accurate results, which can be described as follows. For each target in image i, we compute the combined confidence

M_{n_{i m} \times n_{r l}}^{C} = [m_{i j}^{C}]

for association with vehicle j as:

m_{i j}^{C} = \frac{m_{i j}^{P} \cdot m_{i j}^{T}}{1 - K_{i}}

(9)

where

K_{i}

represents the degree of conflict between the two evidence sources for target i, calculated as:

K_{i} = \sum_{k \neq j} m_{i k}^{P} \cdot m_{i j}^{T}

(10)

The combined confidence values are then normalized to ensure they sum to unity for each target:

m_{i j}^{C} = \frac{m_{i j}^{C}}{\sum_{k} m_{i k}^{C}}

(11)

This fusion approach allows for the integration of evidence from both association methods, potentially mitigating individual weaknesses and enhancing the overall reliability of the association results.

3.5. Enhanced Multi-Object Tracking Process

Original MOT algorithms apply appearance-based association to distinguish inter-frame targets. However, when faced with multiple targets of similar appearance, these methods can actually confuse the re-association process between objects. This limitation becomes particularly problematic in our air–ground cooperative scenarios.

Considering the shortcomings of previous tracking algorithms that cannot utilize appearance features for inter-frame matching when tracking targets with similar appearances, based on our proposed association method, we have added a new step to the classical two-stage tracking algorithm BYTETrack that utilizes sensor information to process mismatched detection boxes.

In Algorithm 1, we use the oriented object detector and orientation classifier to obtain the rotated bounding boxes and orientations of UGVs in the image. We employ a straightforward approach to make BYTETrack compatible with our rotated detection boxes by using the axis-aligned bounding boxes of the rotated detection boxes as input to the original tracker. Additionally, we use sensor buffer S to store the sensor data

{x_{i}^{t}, α_{i}^{t}}_{i = 1}^{n_{r l}}

transmitted from onboard sensors. Similarly, the original BYTETrack tracker maintains the image coordinates and orientation information of all historical trajectories T. We process trajectories T and sensor buffer S using the association and fusion methods described in Section 3.2, Section 3.3, Section 3.4 and Section 3.5, obtaining the identity

t . I D

for each UGV trajectory

t \in T

and the projection matrix

θ_{m}

.

Algorithm 1: Enhanced multi-object tracking process based on BYTETrack. (The lines 7–12 and 15–22 are blue to indicate newly proposed processes based on BYTETrack.)

Input: Video sequence V, oriented object detector

D e t_{r o t}

, orientation classifier

C l s

,

sensor data

{x_{i}^{t}, α_{i}^{t}}_{i = 1}^{n_{r l}}

Output: Video tracks T with associated ID

1. Initialize tracks

T \leftarrow \emptyset

and sensor buffer

S \leftarrow \emptyset

2. for each frame

f_{k} \in V

do

3. Oriented object detection:

D_{k}^{r o t} \leftarrow D e t_{r o t} (f_{k})

4. Orientation classification:

D_{k}^{r o t} . o r i e n t a t i o n \leftarrow C l s ({c r o p (D_{k}^{r o t})})

5. Store sequential sensor data:

S \leftarrow {x_{i}^{t}, α_{i}^{t}}_{i = 1}^{n_{r l}}

6. Split

D_{k}^{r o t}

into

D^{h i g h}, D^{l o w}

by threshold

τ

7.

θ_{m}, M^{P} \leftarrow P r o j e c t i o n B a s e d A s s o c (S, T)

8.

M^{T} \leftarrow T o p o l o g y B a s e d A s s o c ({x_{i}^{t}, α_{i}^{t}}_{i = 1}^{n_{r l}}, T)

9.

M^{C} \leftarrow D e m p s t e r S h a f e r F u s i o n (M^{P}, M^{T})

10. for each track

t \in T

do

11.

t . I D \leftarrow arg {max}_{j} M^{C} (t, j)

12. end for

13.

M, T_{r e m a i n}, D_{r e m a i n} \leftarrow B Y T E T r a c k A s s o c (T, D^{h i g h} \cup D^{l o w})

14. Update matched tracks:

T \leftarrow T \cup M

15. for each

d \in D_{r e m a i n}

do

16.

{{\hat{y}}_{i}^{t}}_{i = 1}^{n_{r l}} = θ_{m} \cdot {x_{i}^{t}}_{i = 1}^{n_{r l}}

17.

d . I D \leftarrow arg {min}_{i} ∥ d . p o s - {{\hat{y}}_{i}^{t}}_{i = 1}^{n_{r l}} ∥

18. if

\exists t \in T_{r e m a i n}

with

t . I D = = d . I D

then

19. Update unmatched track t with unmatched detection d

20.

T \leftarrow T \cup t

,

T_{r e m a i n} \leftarrow T_{r e m a i n} ∖ t

21. end if

22. end for

23. Prune lost tracks (

T \leftarrow T ∖ T_{r e m a i n}

)

24. Init new tracks for

d \in D_{r e m a i n}

with

d . s c o r e > τ

25. end for

26. return T with ID association results

In our processing pipeline for unmatched detection boxes, the projection matrix

θ_{m}

plays a crucial role. Specifically, after the original BYTETrack association process, we obtain matched tracklets M, unmatched detection boxes

D_{r e m a i n}

, and unmatched tracklets

T_{r e m a i n}

. The unmatched detection boxes would normally be initialized as new trajectories, leading to ID switches. In our approach, we use the projection matrix

θ_{m}

to multiply the relative coordinates

{x_{i}^{t}}_{i = 1}^{n_{r l}}

of all UGVs in the current frame t, projecting them to image coordinates

{{\hat{y}}_{i}^{t}}_{i = 1}^{n_{r l}}

. For each unmatched detection box

d \in D_{r e m a i n}

, we find its nearest neighbor in

{{\hat{y}}_{i}^{t}}_{i = 1}^{n_{r l}}

to determine its identity

d . I D

. We then check if there exists an unmatched trajectory with the same ID. If such a trajectory exists, the unmatched detection box is reassociated with this trajectory and updates it, thereby reducing ID switches.

4. Experiments

4.1. Simulation Experiment

We employ the CARLA system to simulate the association and recognition problem in air–ground collaborative scenarios proposed in this paper. Within the Town 10 map, aerial cameras are generated to simulate the perspective of UAV. Multiple identical autonomous vehicles are spawned to represent the visually similar cooperative UGVs. In CARLA environment, these vehicles are able to feed back their position and orientation information in the world coordinate system.

Additionally, our experimental setup involved six moving unmanned vehicles within the field of view and one vehicle outside the field of view. This configuration allows for a comprehensive assessment of the proposed methods under varying topographical conditions and partial observability scenarios. During the simulation, each vehicle maintained a speed exceeding 1.0 m/s, with motion planning conducted by Carla’s built-in autonomous driving functionality, incorporating dynamic speed adjustments and lane-changing maneuvers rather than simple uniform linear motion. For each experiment type within this section, our data collection exceeded 1000 frames per road scenario to ensure statistically robust conclusions and reliable metric evaluation.

To extract the visual cues required for both projection-based and topology-based methods, we employ the YOLOv8-OBB model for rotated detection of UGVs, a ResNet-18 classification network for determining the vehicle heading direction, and finally, ByteTrack for tracking the UGVs targets.

As illustrated in Figure 6, we collect the visual cues in CARLA in two road scenarios. To be more specific, the green rotated bounding boxes represent the output of the rotated object detection model. The red horizontal bounding boxes are the axis-aligned rectangles enclosing the rotated boxes, which are input into the BYTETrack algorithm. Blue dots indicate the vehicle heading direction determined by the ResNet-18 classification model.

In the experiment section, this study employs association precision as a metric to evaluate various association methods under different conditions. The precision is defined as:

Precision = \frac{TP}{TP + FP}

(12)

where

T P

represents the number of correctly associated UGV targets in the UAV perspective imagery, and

F P

denotes the number of incorrectly associated ones. In the following multi-object tracking experiments, we use ID-switch numbers as the metric to indicate whether our enhanced multi-object tracking process makes a difference.

4.1.1. Association Experiments Under Different Conditions

Firstly, we conducted tests on the projection-based association method in both straight and curved road scenarios. Specifically, we utilized relative-image coordinate pairs from 10 historical frames to compute the projection matrix. Subsequently, we performed reprojection on the relative world coordinates of the 11th and 12th frames, deriving the association result whose corresponding matrix minimizes the reprojection error.

To simulate real-world scenarios where localization devices on UGVs are often subject to noise, we artificially introduced Gaussian noise to the relative coordinates reported by CARLA vehicles. This noise was characterized by a mean of 0 and a standard deviation incrementing from 0 to 4 in steps of 0.25. We also rotated the UAV’s perspective at different angles, including 90 degrees and 180 degrees, and obtained the association precision under different perspective changes. Additionally, each experiment incorporated over 1000 frames.

Figure 7 illustrates that varying motion conditions exert a discernible influence on the projection-based method. In more complex motion scenarios, such as when multiple vehicles are on the curved road, the association precision decreases by approximately 10% to 20%. Furthermore, sensor noise demonstrates a significant impact on association precision.

Under conditions of mild sensor localization noise with no perspective change, the association precision in straight road scenarios remains higher than 90%, while in curved road scenarios, it approximates 80%. For each 0.25 increment in the standard deviation of sensor noise, the average decrease in association precision is 1.98% for straight road scenarios and 1.67% for curved road scenarios. When the standard deviation of sensor noise reaches a substantial level of 4, the association precision in straight road scenarios declines to approximately 60%, while in curved road scenarios, it diminishes to merely 50%.

Figure 7 also shows the impact of different perspective transformations on the projection-based association method. It can be observed that the pattern of how association precision is affected by noise is consistent across all different observation angles, and under the same noise level, there is no significant difference in association precision between different perspectives.

Similarly, we conducted analogous experiments on the topology-based association method, introducing Gaussian noise to the angle and position information reported from CARLA vehicles. The standard deviation of this noise was incrementally increased from 0 to 10 in steps of 1. As illustrated in Figure 8, the precision of the topology-based association method exhibits patterns similar to those observed in the projection-based association method. Gaussian noise did provide evidence of an impact on the association results of the topology-based method, as did the complicated motion pattern.

Despite their similarities, the projection-based and topology-based association method exhibit distinct differences in certain aspects. To be more specific, for each 0.25 increment in the standard deviation of sensor noise, the average precision of the topology-based method decreases by 1.06% in straight road scenarios and 0.96% in curved road scenarios. This comparative analysis reveals that the topology-based association method demonstrates superior robustness to noise interference relative to the projection-based method.

However, it is noteworthy that under low-noise conditions (when the standard deviation of noise is less than 1), the projection-based method achieves higher precision than its topology-based counterpart. This discrepancy can be attributed to the fact that, even in the absence of noise interference, the topology-based association method’s performance is contingent upon the accuracy of the ResNet-18 front direction classification model.

In addition, Figure 8 shows the impact of perspective transformation on the precision of the topology-based association method. It can be seen that under the same noise level, different perspective transformations have a more significant impact on the precision compared to the projection-based association method. This is because after the perspective changes, the changes in the topological structure will be significantly greater than the changes in the target motion patterns in the image.

To leverage the distinctive characteristics and respective advantages of both methods, we employed the decision-level fusion approach proposed in this study under identical experimental scenarios. We applied noise with standard deviations of 0.25 and 0.75 to the projection-based and topology-based methods, respectively, incrementing by 0.25. Figure 9 illustrates the variation in association accuracy for both straight and curved road scenarios. The results demonstrate that the fused approach yields a marked improvement of 10–20% in association precision across nearly all noise conditions in both motion patterns. Additionally, Figure 9 also indicates that under different perspective transformations, the decision-level fusion method can reduce the precision fluctuations caused by perspective changes.

The enhanced performance post-fusion stems from three key factors. First, regarding noise resilience, the integration of methods with different noise sensitivities enables superior robustness across diverse noise levels. Second, in terms of information synergy, the combination of spatial (projection-based) and relational (topology-based) information provides a more comprehensive scene representation that facilitates more accurate associations. Third, concerning error compensation, inaccuracies from one method can be effectively counterbalanced by accurate predictions from the other method, thereby enhancing overall accuracy.

Overall, this fusion method not only elevates overall precision but also demonstrates enhanced resilience to noise and adaptability to diverse road conditions.

4.1.2. Association Comparison with TTS Method

This experiment compares the TTS association method [36] with the proposed topology-based approach in this paper. The TTS method, characterized by its scale-invariant property, establishes associations through angular relationships in Delaunay triangulation-based topological structures, making it applicable to cross-coordinate system association tasks similar to our problem and outperforming association methods, including RTF [39] and TTF [35]. Therefore, we use it as our benchmark for comparison. Distinct from TTS, our proposed topology-based method incorporates target orientation information during topological construction and introduces a projection-based approach with decision-level fusion.

As illustrated in Figure 10, while both methods perform association by constructing and comparing topological similarities, they exhibit fundamental differences. The TTS method solely relies on positional information to establish triangular topologies. Although achieving acceptable association accuracy in low-noise environments, its performance degrades significantly under noise interference. In contrast, our topology-based method achieves superior accuracy across most noise conditions.

In addition to precision, we further evaluated the real-time performance of the methods using a hardware configuration consisting of a 13th Gen Intel(R) Core(TM) i9-13900HX CPU and an NVIDIA GeForce RTX 3090 GPU. The specific latency results are presented in Table 1.

The TTS method exhibits the shortest latency, but its precision is unsatisfactory in noisy environments. The topology-based method has slightly higher latency than TTS but maintains better performance. The projection-based method, due to the need for multiple matrix operations and least-squares computations, results in a latency exceeding 100 ms.

However, this is acceptable in practical applications. Once an association is established via the projection-based method, the tracking algorithm can continuously track targets with confirmed identities without repeated associations. Furthermore, the projection matrix generated by the projection-based method aids re-association in occlusion scenarios, eliminating the need for full re-association (the topology-based and TTS methods) after occlusion. Finally, the proposed decision fusion process adds negligible computational overhead, completing fusion in less than 1 ms.

4.1.3. Association Experiments Under False Detection Condition

Current association algorithms, including our proposed algorithm, often rely on the target detection module. When an error occurs in the target detection module, such as false detection (or False Positive, FP) situation, the subsequent association module tends to have an error right after it. This is because the association algorithm will correlate the extra false positive targets and obtain association results that should not exist.

In order to evaluate the association performance of our proposed association algorithm under false detection condition, we collected and compared more than 1800 frames of association results with projection-based, topology-based association method and the TTS method in two false detection scenarios, as shown in Figure 11. In each scenario, a total of six vehicles in the swarm fed back their sensor data, and the detection algorithm detected a total of seven UGVs in the UAV view, with one extra false positive target.

After the association algorithm has finished, there will be one unassociated vehicle left. We quantified the algorithm’s capability to exclude the FP target by verifying whether the unassociated vehicle corresponded to the FP instance. Therefore, in addition to the original association precision metric, we introduce the False Positive Exclusion Rate (FPER) to quantify the algorithm’s capability to correctly exclude FP detections. The FPER is defined as:

F P E R = \frac{N_{c}}{N_{t}}

(13)

where

N_{c}

represents the number of frames in which the FP target is correctly excluded, and

N_{t}

is the total number of frames containing FP detections. FPER provides a quantitative measure of the algorithm’s robustness against false positive inputs from the detection module.

Table 2 presents the association precision and FPER metrics for three association methods under false detection scenarios. Our projection-based method slightly outperforms the topology-based method, primarily due to its utilization of sequential information rather than relying solely on single-frame data. Furthermore, both proposed methods demonstrate significantly superior FPER metrics compared to the TTS method. This disparity can be attributed to the TTS method’s use of Delaunay triangulation in constructing topological structures. The presence of false positive targets interferes with the overall topology construction, thereby substantially disrupting the subsequent calculation of topological structure similarity. Figure 12 also illustrates the qualitative visualization for comparison, where TTS failed to recognize the false detected target.

4.1.4. Enhanced Tracking Experiments Under Occlusion Conditions

Since existing object tracking datasets lack onboard sensor information for tracked targets, to validate the effectiveness of our proposed enhanced multi-object tracking (MOT) process, we artificially place blue-dot occlusions in bird’s-eye-view images of curved road scenarios, simulating real-world occlusions and missed detection phenomena.

Figure 13 demonstrates the performance of our enhanced BYTEtrack algorithm and the original BYTEtrack during the simulation. The blue dots represent synthetic occlusions, strategically placed at vehicle turning points in the image. During occlusion, due to the complex motion patterns of turning vehicles, the Kalman filter fails to accurately predict the vehicle’s trajectory. Consequently, when the target reappears, its detection bounding box cannot be correctly matched with the predicted state, leading to mismatched detections and ID-switch errors.

Our enhanced MOT process addresses this issue in the two cases by reprojecting the mismatched bounding boxes and re-associating them with previously mismatched trajectories, thereby reducing the occurrence of ID-switch errors.

4.2. Physical Experiments

To validate the practical applicability of our approach, we conducted comprehensive physical experiments. The experimental setup comprised three unmanned ground vehicles (UGVs) equipped with onboard GNSS/INS systems transmitting their location, along with a drone capturing aerial imagery at 1080p resolution from an operational altitude of 39 m. The drone was sourced from DJI Technology Co., Ltd., Shenzhen, China. The system photos of the UAV and UGVs used in the physical experiment are shown in Figure 14.

Figure 15 illustrates the physical environment where multiple visually identical UGVs were deployed in an open-field setting with natural terrain variations. To evaluate the robustness of our association framework against false detections, we intentionally included one human operator within the drone’s field of view to simulate false positive detection scenarios. All experimental trials exceeded 1000 consecutive frames to ensure statistical significance, with each UGV operated independently by different handlers executing randomized motion patterns including non-linear trajectories, sudden direction changes, and variable speeds.

Table 3 presents the quantitative comparison between our projection-based method and the TTS baseline under real-world conditions. While both methods achieved comparable False Positive Exclusion Rates (FPER) of 75.00% and 82.00%, respectively, our projection-based approach demonstrated significantly superior association precision. This performance disparity stems from fundamental methodological differences: The TTS method relies exclusively on static spatial relationships through Delaunay triangulation, which becomes ineffective when UGVs form near-equilateral formations where inter-target distances lack discriminative power. In contrast, our projection-based method leverages temporal motion patterns and sequential positional information, enabling it to maintain robust performance.

5. Conclusions

This paper focuses on the challenge of distinguishing visually identical UGVs from a UAV’s perspective. Deep learning-based techniques for multi-object detection and association are widely used, but most of these methods either focus on tasks within multi-UAV image coordinate systems or rely heavily on visual appearance features. As a result, there is a research gap when it comes to associating UAV visual detections (in image coordinates) with UGV onboard sensor data (latitude, longitude, orientation) for such visually identical targets.

We reformulate the identification problem as an association task between onboard sensor data and UAV-observed positions and directions of UGVs. To resolve this association issue, we propose a unified association framework that fuses projection-based and topology-based methods using Dempster–Shafer theory to generate robust association results. Experimental results demonstrate that both of our proposed methods outperform previous association techniques under various noise conditions. Moreover, the fusion method enables our approach to maintain robust performance even in high-noise environments. Additionally, our methodology effectively handles false detections from the object detection module and provides auxiliary support for target tracking algorithms, mitigating ID-switching due to occlusions.

The work presented in this paper is applicable to various tasks involving air–ground collaborative systems. It enables these systems to overcome the application constraints imposed by markers such as AprilTag, requiring only inexpensive onboard sensor information to accomplish identity recognition of visually identical targets in images. This advancement potentially broadens the air–ground system applications and flexibility in diverse operational scenarios.

The proposed methods have certain limitations. When UGVs exhibit highly similar or identical motion patterns, the projection-based approach may struggle to differentiate targets based on their past movement characteristics. Furthermore, the topology-based method not only relies on the object detection module but also depends on a vehicle heading orientation classification module to establish the topological structure in the image.

Author Contributions

Methodology, investigation, validation, visualization and writing—original draft, Y.C.; visualization and writing—review, B.D.; supervision and resources, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Thanks to the Unmanned Systems Research Group at the College of Intelligence Science, National University of Defense Technology, China.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Elmokadem, T.; Savkin, A.V. Towards Fully Autonomous UAVs: A Survey. Sensors 2021, 21, 6223. [Google Scholar] [CrossRef]
Linker Criollo, C.; Mena-Arciniega, S.; Xing, S. Classification, Military Applications, and Opportunities of Unmanned Aerial Vehicles. Aviation 2024, 28, 115–127. [Google Scholar] [CrossRef]
Du, S.; Zhong, G.; Wang, F.; Pang, B.; Zhang, H.; Jiao, Q. Safety Risk Modelling and Assessment of Civil Unmanned Aircraft System Operations: A Comprehensive Review. Drones 2024, 8, 354. [Google Scholar] [CrossRef]
Tang, J.; Duan, H.; Lao, S. Swarm Intelligence Algorithms for Multiple Unmanned Aerial Vehicles Collaboration: A Comprehensive Review. Artif. Intell. Rev. 2023, 56, 4295–4327. [Google Scholar] [CrossRef]
Zhou, C.; Li, J.; Shi, M.; Wu, T. Multi-robot path planning algorithm for collaborative mapping under communication constraints. Drones 2024, 8, 493. [Google Scholar] [CrossRef]
Ma, T.; Lu, P.; Deng, F.; Geng, K. Air–Ground Collaborative Multi-Target Detection Task Assignment and Path Planning Optimization. Drones 2024, 8, 110. [Google Scholar] [CrossRef]
Minaeian, S.; Liu, J.; Son, Y.-J. Vision-based target detection and localization via a team of cooperative UAV and UGVs. IEEE Trans. Syst. Man Cybern. Syst. 2016, 46, 1005–1016. [Google Scholar] [CrossRef]
Wang, M.; Li, R.; Jing, F.; Gao, M. Multi-UAV Assisted Air–Ground Collaborative MEC System: DRL-Based Joint Task Offloading and Resource Allocation and 3D UAV Trajectory Optimization. Drones 2024, 8, 510. [Google Scholar] [CrossRef]
Tan, Q.; Yang, X.; Qiu, C.; Liu, W.; Li, Y.; Zou, Z.; Huang, J. Graph-based target association for multi-drone collaborative perception under imperfect detection conditions. Drones 2025, 9, 300. [Google Scholar] [CrossRef]
Olson, E. AprilTag: A Robust and Flexible Visual Fiducial System. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3400–3407. [Google Scholar] [CrossRef]
Wang, J.; Olson, E. AprilTag 2: Efficient and Robust Fiducial Detection. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 4193–4198. [Google Scholar] [CrossRef]
Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.J.; Marín-Jiménez, M.J. Automatic Generation and Detection of Highly Reliable Fiducial Markers Under Occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Yager, R.R. On the Dempster-Shafer Framework and New Combination Rules. Inf. Sci. 1987, 41, 93–137. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar] [CrossRef]
Liu, D.; Zhao, L.; Wang, Y.; Kato, J. Learn from Each Other to Classify Better: Cross-Layer Mutual Attention Learning for Fine-Grained Visual Classification. Pattern Recognit. 2023, 140, 109550. [Google Scholar] [CrossRef]
Yang, J.; He, K.; Zhang, J.; Li, J.; Chen, Q.; Wei, X.; Sheng, H. A Binocular Vision-Assisted Method for the Accurate Positioning and Landing of Quadrotor UAVs. Drones 2025, 9, 35. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations for Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 5689–5698. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple Object Tracking with Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8122–8130. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Yuan, Z.; Bai, S.; Kitani, K.; Luo, P. DanceTrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20993–21002. [Google Scholar]
Yang, F.; Odashima, S.; Masui, S.; Jiang, S. Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4799–4808. [Google Scholar]
de Oliveira, I.O.; Fonseca, K.V.O.; Minetto, R. A Two-Stream Siamese Neural Network for Vehicle Re-Identification by Using Non-Overlapping Cameras. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 669–673. [Google Scholar] [CrossRef]
Yang, S.; Ding, F.; Li, P.; Hu, S. Distributed Multi-Camera Multi-Target Association for Real-Time Tracking. Sci. Rep. 2022, 12, 11052. [Google Scholar] [CrossRef]
Yue, S.; Yue, W.; Shu, W.; Xiu, S. Fuzzy Data Association Based on Target Topology of Reference. J. Natl. Univ. Def. Technol. 2006, 28, 105–109. [Google Scholar]
Hao, Z.; Chula, S. Algorithm of Multi-Feature Track Association Based on Topology. Command. Inf. Syst. Technol. 2020, 11, 83–88. [Google Scholar]
Li, X.; Wu, L.; Niu, Y.; Ma, A. Multi-Target Association for UAVs Based on Triangular Topological Sequence. Drones 2022, 6, 119. [Google Scholar] [CrossRef]
Li, X.; Wu, L.; Niu, Y.; Jia, S.; Lin, B. Topological Similarity-Based Multi-Target Correlation Localization for Aerial-Ground Systems. Guid. Navig. Control 2021, 1, 2150016. [Google Scholar] [CrossRef]
Han, R.; Feng, W.; Zhang, Y.; Zhao, J.; Wang, S. Multiple Human Association and Tracking from Egocentric and Complementary Top Views. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5225–5242. [Google Scholar] [CrossRef] [PubMed]
Tian, W.; Wang, Y.; Shan, X.; Yang, J. Track-to-Track Association for Biased Data Based on the Reference Topology Feature. IEEE Signal Process. Lett. 2014, 21, 449–453. [Google Scholar] [CrossRef]

Figure 1. The association problem we proposed in our paper.

Figure 2. System architecture of the association framework.

Figure 3. Workflow of the projection-based association method.

Figure 4. The training and inference process of our rotated object detection model and object direction classification model.

Figure 5. The angular topology we construct in our topology-based association method.

Figure 6. UAV imagery of curved and straight road scenarios in air–ground collaborative association experiments. (a) Straight road. (b) Curved road.

Figure 7. The association results of projection-based method under different Gaussian noise conditions, road scenarios and different perspective transformations.

Figure 8. The association results of the topology-based method under different Gaussian noise conditions, road scenarios and perspective transformations.

Figure 9. Association precision under different Gaussian noise conditions and perspective transformations. (a) Straight road, (b) Curved road.

Figure 10. The comparison results of our topology-based method with the TTS method in different Gaussian noise conditions and different road scenarios.

Figure 11. Two false detections occurred during our association experiments. (a) A false positive detectioW n of a similar model black vehicle. (b) A false positive detection of a white flag on a building.

Figure 12. The qualitative visualization of our methods and TTS under the same false detection occasion.

Figure 13. Typical multi-object tracking cases under occlusion. Since the CARLA environment lacks occluding elements, the blue dots represent artificial occlusion markers.

Figure 14. The UGV and UAV employed in our physical experimental setup.

Figure 15. The physical air–ground association scenario. The four bounding boxes represent the four detected targets, including one falsely detected operator and three UGVs.

Table 1. Latency comparison of different association methods.

Method	Latency (ms)
Projection-based	148.49
Topology-based	34.48
TTS	22.52
Dempster–Shafer Fusion	0.27

Table 2. Performance comparison of different association methods under false detection conditions.

Method	Precision	FPER
Projection-based	99.07%	100.00%
Topology-based	96.53%	98.53%
TTS	88.43%	45.71%

Table 3. Quantitative comparison in physical experiments.

Method	Precision	FPER
Projection-based	74.20%	75.00%
TTS	19.50%	82.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Du, B.; Wu, T. Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems. Drones 2025, 9, 612. https://doi.org/10.3390/drones9090612

AMA Style

Chen Y, Du B, Wu T. Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems. Drones. 2025; 9(9):612. https://doi.org/10.3390/drones9090612

Chicago/Turabian Style

Chen, Yang, Binhan Du, and Tao Wu. 2025. "Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems" Drones 9, no. 9: 612. https://doi.org/10.3390/drones9090612

APA Style

Chen, Y., Du, B., & Wu, T. (2025). Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems. Drones, 9(9), 612. https://doi.org/10.3390/drones9090612

Article Menu

Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems

Abstract

1. Introduction

2. Related Work

2.1. Target Identification

2.2. Multi-Target Association

3. Methods

3.1. Overall Framework

3.2. Projection-Based Association Method

3.3. Topology-Based Association Method

3.4. Decision-Level Fusion Based on Dempster–Shafer Method

3.5. Enhanced Multi-Object Tracking Process

4. Experiments

4.1. Simulation Experiment

4.1.1. Association Experiments Under Different Conditions

4.1.2. Association Comparison with TTS Method

4.1.3. Association Experiments Under False Detection Condition

4.1.4. Enhanced Tracking Experiments Under Occlusion Conditions

4.2. Physical Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI