A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos

Liu, Shanshan; Shen, Xinglin; Xiao, Shanzhu; Li, Hanwen; Tao, Huamin

doi:10.3390/rs17061014

Open AccessArticle

A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos

by

Shanshan Liu

,

Xinglin Shen

,

Shanzhu Xiao

,

Hanwen Li

and

Huamin Tao

^*

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1014; https://doi.org/10.3390/rs17061014

Submission received: 8 January 2025 / Revised: 24 February 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Intelligent Processing and Application of UAV Remote Sensing Image Data)

Download

Browse Figures

Versions Notes

Abstract

Unmanned Aerial Vehicle (UAV) vehicle-tracking technology has extensive potential for application in various fields. In the actual tracking process, the relative movement of the UAV and vehicles will bring large target-scale variations (i.e., size and aspect ratio change), which leads to missed detection and ID switching. Traditional tracking methods usually use multi-scale estimation to adaptively update the target scale for variable-scale detection and tracking. However, this approach requires selecting multiple scaling factors and generating a large number of bounding boxes, which results in high computational costs and affects real-time performance. To tackle the above issue, we propose a novel multi-target tracking method based on the BoT-SORT framework. Firstly, we propose an FB-YOLOv8 framework to solve the missed detection problem. This framework incorporates a Feature Alignment Aggregation Module (FAAM) and a Bidirectional Path Aggregation Network (BPAN) to enhance the multi-scale feature fusion. Secondly, we propose a multi-scale feature-fusion network (MSFF-OSNet) to extract appearance features, which solves the ID switching problem. This framework integrates the Feature Pyramid Network (FPN) and Convolutional Block Attention Module (CBAM) into OSNet to capture multilevel pixel dependencies and combine low-level and high-level features. By effectively integrating the FB-YOLOv8 and MSFF-OSNet modules into the tracking pipeline, the accuracy and stability of tracking are improved. Experiments on the UAVDT dataset achieved 46.1% MOTA and 65.3% IDF1, which outperforms current state-of-the-art trackers. Furthermore, experiments conducted on sequences with scale variations have substantiated the improved tracking stability of our proposed method under scale-changing conditions.

Keywords:

unmanned aerial vehicle; multiple object tracking; vehicle tracking; scale-variation; BoT-SORT

1. Introduction

Vehicle tracking has a wide range of applications in various fields, such as intelligent transportation and environmental monitoring [1,2], as well as emergency response and disaster management [3]. Owing to the adaptability and safety of UAVs, vehicle tracking from a drone’s viewpoint has garnered significant research interest in recent years [4,5,6]. When multiple vehicles are in the UAV view, the vehicle-tracking task is transformed into a multi-object tracking (MOT) task. During the tracking process, when the UAV and vehicles are moving at the same time, it leads to a variation in the scale of vehicle targets, which leads to missed detection and identity-switching of vehicle targets, thus greatly increasing the difficulty of MOT [7,8,9].

In the field of computer vision, the application of MOT has shifted from traditional techniques to those centered around the advancements in deep learning. Traditional methods depend on manual feature design, which is cumbersome and inefficient. In contrast, MOT approaches leveraging deep neural networks have emerged as the dominant paradigm, offering enhanced adaptability and robustness across various scenarios. These advanced techniques are primarily classified into Tracking-Based Detection (TBD) and Joint Detection Tracking (JDT). The TBD algorithm [10,11,12] employs a two-stage design structure, separating the detection and tracking modules, allowing these two modules to be optimized independently. However, this design may result in suboptimal solutions. Conversely, the JDT algorithm [13,14,15] combines detection models in one framework by adding prediction modules or embedding branches to detectors. While this integration enhances inference speed and surpasses the TBD algorithm in straightforward scenarios, it exhibits limited effectiveness in complex environments. Therefore, in the domain of UAV vision, MOT primarily adopts the TBD paradigm. Compared to popular methods, we also employ a detection-based MOT framework for UAV vision. TBD-based trackers link detected targets across frames to form complete trajectories using appearance, motion, and additional characteristics.

The TBD framework primarily depends on object detection tasks, with the majority of tracking systems employing deep neural networks for re-identification (ReID) to capture distinctive visual characteristics that enable target differentiation. In the actual target tracking process, variations in the distance between the vehicle and the UAV, as well as in the camera angle, can lead to significant differences in size, aspect ratio, and texture details of the same target, thus affecting the process of target detection and re-recognition based on appearance features. Many researchers [16,17] have used multi-scale feature-fusion methods to combine information from different levels of features to maximize the use of multi-scale output. However, the fusion process often involves simple concatenation operations, which do not objectively reflect the correlation between different levels of features and lack information interaction. Therefore, how to efficiently perform feature fusion remains a future research direction.

Vehicle re-identification methods commonly utilize convolutional neural networks (CNNs) for vehicle feature extraction, followed by distance metric loss to optimize the distances between these features. For UAV-based vehicle re-identification, vehicles under different viewpoints often exhibit fundamentally distinct visual appearances, leading to different feature distributions at the feature level. As discussed in existing work [18], the feature distance for the same vehicle from different angles can be larger than for different vehicles from the same angle. Refs. [18,19] incorporated viewpoint features to enhance the robustness of features against viewpoint variations. Additionally, different scales of information exhibit distinct distributions in the feature space [20], highlighting the critical role of hierarchical feature learning in re-identification tasks. Therefore, designing multi-scale feature-fusion strategies is essential to enhance the robustness of features against scale differences.

To overcome the challenges of missed detections and false positives arising from scale variations on UAV platforms, we propose an efficient multi-object tracking algorithm with multi-scale feature fusion for scale-variant vehicle targets in drone view based on BoT-SORT [21]. In the detection phase, unlike other detection methods that employ concatenation operations for feature fusion, we propose a multi-scale feature alignment and aggregation object detection approach based on YOLOv8 [22]. We mainly introduce the Feature Alignment Aggregation Module (FAAM) in the neck part of the network to solve the problem of feature misalignment and propose the Bidirectional Path Aggregation Network (BPAN) network to enhance the multi-scale feature-fusion capability. In the vehicle re-identification process for extracting appearance features, we introduce a Feature Pyramid Network (FPN) [23] based on the OSNet architecture [24] to capture pixel dependencies across multiple feature maps. Subsequently, the FPN aggregates features from different levels, resulting in features that simultaneously incorporate low-level information and high-level semantics. Additionally, we incorporate the Convolutional Block Attention Module(CBAM) [25] attention mechanism to assist the model in concentrating on regions of the image that contain important information, thereby boosting the accuracy of the re-identification model.

By optimizing detection and appearance feature extraction as two separate tasks, the FB-YOLOv8 method enhances target localization accuracy. The improved re-identification approach then extracts richer appearance features based on the provided target locations, offering reliable distance metrics for trajectory association. Experimental results indicate that the introduced module achieves significant tracking and association performance when targets undergo scale variations, showcasing the excellence of the proposed approach on the UAVDT dataset [26]. The key contributions are outlined below:

(1) We propose an FB-YOLOv8 network to achieve higher detection precision of multi-scale objects in intricate scenarios. This framework addresses the challenge of scale variation in vehicle tracking by incorporating a Feature Alignment Aggregation Module (FAAM) and a Bidirectional Path Aggregation Network (BPAN), which significantly improves the detection capabilities in environments featuring various object dimensions and viewpoints.

(2) We propose a multi-scale feature-fusion network based on the OSNet backbone (MSFF-OSNet) for vehicle re-identification. By integrating OSNet with a Feature Pyramid Network (FPN) and Convolutional Block Attention Module (CBAM) attention mechanism, our approach not only captures pixel dependencies across multiple feature maps but also focuses on salient image regions.

(3) We propose a MOT algorithm with multi-scale feature fusion and the experimental results on the UAVDT dataset demonstrate the superiority of the proposed method. The fusion of the FB-YOLOv8 detection network with the MSFF-OSNet re-identification network reduces missed detections and identity switches.

The subsequent sections are organized as follows: Section 2 offers an overview of prior research. Section 3 introduces the specific detection method and re-identification method that constitute our tracking framework. Section 4 showcases the experimental results on the UAVDT dataset, which illustrates the efficacy of the introduced tracking approach in contrast to state-of-the-art methods. Conclusively, Section 5 summarizes the paper.

2. Related Works

This section outlines the prevalent frameworks for multi-object tracking and strategies to enhance tracking efficiency. Section 2.1 introduces object detection techniques, Section 2.2 examines the specifics of multi-object tracking, and Section 2.3 discusses methods for object re-identification.

2.1. Object Detection Algorithm

Currently, objection detection methods are categorized into traditional methods and deep learning methods. Traditional detection methods require manual design of feature extraction for different scenes and targets, leading to diminished generalization capabilities and constrained robustness. In contrast, deep learning-based methods, which address the limitations of conventional methods, are more adept at handling the complexities of various UAV image object detections. Based on the utilization of detection boxes, methods rooted in deep learning are classified into two groups: Anchor-based and Anchor-free. Among them, Anchor-based methods are further classified into two types. One is single-stage algorithm with better real-time performance, exemplified by algorithms such as YOLO series [27,28,29,30,31,32], SSD [33], and RetinaNet [34]. The other is a two-stage detection algorithm with high detection accuracy, represented by Fast R-CNN [35] and Faster R-CNN [36]. Conversely, Anchor-free methods dispense with the use of detection boxes, reducing computational complexity, represented by CornerNet [37] and CenterNet [15].

Object detection in drone images faces numerous challenges, such as complex backgrounds, dense objects, a high density of tiny objects, and multi-scale variations of objects. The YOLO series has gained widespread popularity in UAV-based object detection tasks, primarily owing to its outstanding precision and efficiency benefits. Responding to changes in the scale of small and large goals, Qin et al. [16] proposed the UAV target detection algorithm MCA (Multi-Scale Context Aggregation)-YOLOv7 based on YOLOv7, which improved the backbone network’s ability to extract multi-scale features and facilitated global feature integration. However, it still has the disadvantages of excessive complexity and a slow rate of convergence. To handle the difficulties posed by a variety of object scales and small objects, Shi et al. [17] introduced a novel multi-scale feature learning and feature-fusion framework based on YOLOv8 under the guidance of deformable convolution for UAV-based object detection. Scale feature fusion can combine information from various hierarchical features, maximizing the utilization of multi-scale outputs. To summarize the above methods, the robustness and detection accuracy of the models for multi-scale objects in intricate environments still need to be further enhanced.

2.2. Multiple Object Tracking

Along with the advancement of UAV technology, its convenience and flexibility are receiving increasing attention from researchers in the field of UAV video MOT. A common MOT approach in UAV scenarios is tracking by detection, i.e., first detecting the target and then correlating it based on its appearance and motion cues. SORT [10] employs a deep learning detector (Fast R-CNN [35]) to detect potential targets within each video frame. Then Kalman filtering [38] is used for state prediction and updating to obtain the best estimate. Finally, trajectory association is executed through the Hungarian algorithm, which operates on a cost matrix derived from the Intersection over Union (IoU) between target detection boxes and trajectory boxes. DeepSORT [11] introduces ReID for the extraction of appearance features, thereby decreasing the incidence of ID switches, and employs a cascaded matching approach to enhance MOT performance. Bot-SORT [21] enhances the robustness of the tracking system by merging motion and appearance information and including compensation for camera movement.

The JDT framework integrates target detection with data association, outperforming two-stage methods in real-time execution. Joint Detection and Embedding (JDE) algorithm [14] was the first to combine an appearance embedding model with a detection network, enabling simultaneous output of detection results and corresponding target appearance embeddings. This method decreases computational burden while preserving accuracy, thereby boosting the system’s efficiency. Similar to JDE, FairMOT [13] integrates the detection network with feature extraction and employs the Anchor-free CenterNet [15] as the detector, treating targets as points and regressing other features such as size and offset from the center point features, effectively addressing the issue of anchor center offset. Li et al. [39] proposed the SMFMOT model and introduced the Set-Membership Filter (SMF), modeling the target state as a bounded set. Integrating an appearance-matching cascade strategy significantly enhances the tracking stability of non-uniformly moving targets in UAV videos.

Recently, Transformer [40] has exhibited substantial potential for MOT tasks. TransTrack [41] is the first work to solve the MOT problem based on Transformer, which designs a dual-path mechanism of object query and tracking query by sharing detection and tracking knowledge. Feature matching is accomplished by utilizing Transformer’s multi-head attention layer to achieve high accuracy tracking. However, TransTrack does not feed the target query and tracking query into the same decoder and cannot update the prediction of the “track query” to recognize the emerging target objects. The emergence of TrackFormer [42] breaks this situation, but it can only perform short-term learning between two neighboring frames, and enhancements to the tracking efficacy are required. Wang et al. [43] implements a dual decoder approach on top of TrackFormer, borrowing from the TransTrack idea. Where one decoder is used for detection and one decoder is used for tracking. The image feature information is combined with Kalman filtering to correct the target samples, predict the trajectory, and handle the disappearance of target occlusion during tracking. Appearance features are used for histogram matching to assign the same ID to the target disappearing due to occlusion as when the target reappears. The Transformer model has a considerable amount of parameters, and real-time performance still needs to be improved.

Currently, detection-based tracking has become the prevailing paradigm for MOT in UAVs. DeepSORT Y_RN [44] integrates the YOLOv3, RetinaNet, and DeepSORT algorithms, achieving real-time tracking on the VisDrone2018 dataset. Wu et al. [45] integrates YOLOv4 Tiny and DeepSORT algorithm to create the SORT-YM network, leveraging messages from targets before occlusion to predict their positions through multi-frame data, thereby addressing the issue of target occlusion to some extent. Ning et al. [46] utilized YOLOv5 for the acquisition of real-time target locations and integrated it with DeepSORT to facilitate the measurement of target velocities. Gao et al. [47] integrated the Siamese RPN network into the detection model, which significantly enhanced the model’s ability to cope with various complex scenarios, thus effectively improving the tracking accuracy. Pramanik et al. [48] developed an innovative detection-based multi-target tracking technique by combining the G-RCNN detector with the Deep SORT algorithm. Yeom et al. [49] introduced a hybrid network combined GM-YOLO and multi-trackers, combining a lightweight detection network with a multilevel data correlation strategy, and the algorithm effectively balances accuracy and speed by dynamically adjusting detection thresholds and motion prediction models.

The altitude of UAVs varies over time, and the tracked objects are also in motion. When the distance from the UAV to the object alters, the scale of the object correspondingly changes. Zheng et al. [50] addressed the issue of severe scale variation under UAV perspectives leading to tracking failure by proposing a Siamese-based model-free scale-aware tracker. However, this approach still suffers from slow inference speeds, causing it to be incompatible for use in real-time scenarios. Liu et al. [51] designed a tracking architecture to address the challenge of rapidly evolving targets in spatial tracking and established an online dynamic target repository, which can effectively process the relationship between the search area and multiple targets in parallel to enhance the accuracy of target tracking. Huang et al. [52] tackled the frequent and unbalanced scale variations of targets under UAV perspectives by integrating the detection and tracking architectures and introducing a Hierarchical Deep High-Resolution Network (HDHNet). This algorithm improves tracking stability during frequent scale changes but still faces challenges in achieving real-time performance. Given these limitations, there is a critical need to explore advanced techniques to enhance tracking accuracy under scale variations in UAV-based vehicle-tracking scenarios.

2.3. Re-Identification Algorithm

Re-identification is a key task in computer vision, which requires determining whether a given query target, based on image information, video sequences, or textual feature descriptions, has appeared at different times, under different cameras, and in various locations. DeepSORT incorporates an appearance feature model utilizing a re-identification model to obtain the appearance features of an object. The introduction of the appearance model enhances the expressive capability of target-related information, enabling accurate matching and confirmation of trajectories even during periods of target occlusion, thus reducing identity switches when obstructions occur. Then, re-identification networks are utilized in most trackers to extract the appearance features of targets, thereby distinguishing between different individuals. Vehicle re-identification under UAV perspectives faces more challenges, including diverse viewpoint changes and significant scale variations (as shown in Figure 1).

Fu et al. [53] presented a model for tracking multiple vehicles that uses an enhanced ResNet-18 network for extracting vehicle re-identification features. By combining trajectory information and positional information, they constructed a similarity matrix to accomplish the best possible matching of vehicle targets across consecutive frames. Lu et al. [54] proposed the GASNet for the vehicle ReID task to accommodate the multi-perspective and multi-resolution attributes of vehicle targets. Attention mechanisms have seen extensive application in addressing kinds of challenges in ReID tasks. The fundamental idea of attention mechanisms is to enable deep learning models to disregard unrelated information in images and concentrate on crucial information. Xu et al. [55] introduced a spatial and temporal joint attention pooling network to learn target trajectories in a given video sequence, allowing the feature-extraction network to capture inter-dependencies between matching items and directly influence the computation of each other’s features. Chen et al. [56] aims to integrate attention modules with diversity as supplementary strategies to collectively enhance ReID capabilities. Zhang et al. [57] introduce an efficient model that assesses the significance of individual spatiotemporal features from a comprehensive viewpoint. Nevertheless, these spatiotemporal attention techniques are rough and lack stability in handling complex relationships between parts. Particularly when comparing the differences between targets from an unmanned aerial vehicle perspective, a more fine-grained ReID feature is required.

However, these methods have not effectively addressed the issue of pattern deformation caused by distance and orientation changes in UAV images. To enhance the discriminative feature representation of objects, our aim is to validly merge local and global messages, preserving object details and retaining a holistic perspective.

3. Methodology

3.1. Overall Framework

In our multi-object tracking method, two innovative modules are introduced to enhance the accuracy and stability of tracking targets with varying scales. The framework of our proposed tracking system is shown in Figure 2. Firstly, each frame from the video sequence is sequentially fed into the tracker, where the FB-YOLOv8 detector is utilized to obtain more accurate target bounding boxes and their categories within the sequence. Subsequently, the detection results are fed into a multi-scale feature-fusion appearance feature-extraction network (MSFF-OSNet) to acquire the appearance features. Concurrently, the Kalman filter algorithm predicts the bounding boxes of targets in the following frames by utilizing the detection outcomes of the current frame. The FB-YOLOv8 is aimed at enhancing detection performance and mitigating the issue of missed detections, while the MSFF-OSNet is designed to extract more discriminative identity features, reducing the occurrence of identity switches. The outputs of these two modules are ultimately quantified as distances of feature vectors and used to formulate the association process as a global allocation problem. Finally, a matching algorithm is employed to associate the detected targets with existing trajectories.

3.2. FB-YOLOv8

To achieve better object detection, we chose the YOLOv8 algorithm as the base model. The versions of YOLOv8 include YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Since YOLOv8n is the lightest version in the YOLOv8 series, it focuses on speed and real-time performance while maintaining good detection accuracy. Therefore, we select YOLOv8n as the baseline of our method. The architectural configuration of YOLOv8 is illustrated in Figure 3. Although the YOLOv8 algorithm is already quite efficient, the detection performance under multi-scale-changing scenarios is suboptimal. The reason is that the network only employs concatenation operations when fusing the extracted multiple features. This fusion method lacks interactive information exchange, which leads to suboptimal detection results for targets of varying sizes. Therefore, we propose an FB-YOLOv8 model to obtain higher detection accuracy when UAV images undergo multi-scale changes.

The comprehensive structure of the FB-YOLOv8 module is depicted in Figure 4. To align different resolution feature maps, we integrate a Feature Alignment Aggregation Module (FAAM) in the neck and propose a Bidirectional Path Aggregation Network (BPAN) to additionally improve the network’s ability for multi-scale fusion. Meanwhile, a shallow detection head is appended in the prediction stage to enhance the detection ability of tiny targets.

3.2.1. Feature Alignment Aggregation Module

As the network layer deepens, the feature information derived from the shallow and deep networks becomes increasingly inconsistent. The shallow network tends to capture texture, color, and edge-based features to obtain a detailed representation of the object contour. In contrast, the features derived from the deep network are more generalized with high-level semantics, but they miss the fine contour details that the shallow network supplies. Upsampling the low-resolution features and concatenating them with high-resolution features does not take into account the spatial and semantic information alignment between features. To address the issue of inaccurate segmentation edges caused by this problem, Huang et al. [58] introduced AlignSeg, which includes two main components: the AlignFA for feature aggregation and the AlignCM for context modeling. AlignFA uses a simple trainable interpolation method to learn pixel transformation offsets, which effectively relieves the feature misalignment problem due to multi-resolution feature fusion. For small targets, as the number of layers deepens, the spatial information gradually weakens, so it needs to be better integrated with high-resolution features to enhance spatial information. Drawing inspiration from the AlignSeg network, we introduce the FAAM, which is integrated into the neck pyramid network to align two adjacent features of different resolutions. The alignment operation enables the features of different resolutions to correspond in space, thus enhancing the model’s capability to handle both detail and global information and improving the detection algorithm’s ability to accurately locate tiny targets.

The structural configuration of the FAAM is illustrated in Figure 5. We first upsample the low-resolution feature

F_{i}

-based bilateral interpolation. Then,

P_{i}

and the high-resolution feature

F_{i - 1}

are connected in series. The concatenated features are processed through a depthwise separable convolution module consisting of depthwise convolution and pointwise convolution to predict an offset. A deep convolution layer with a kernel size of 3 × 3 is used to extract the local spatial information surrounding each sampling location, which is then fed into pointwise convolution to produce a two-dimensional offset

Δ_{i}

. The offset

Δ_{i}

and high-resolution feature

F_{i - 1}

are input into the alignment function to acquire a spatially aligned high-resolution feature map. Finally, the aligned feature map is combined with the upsampled feature map to create a fusion feature map that contains both detailed information and global information. Mathematically, feature alignment can be expressed as:

F_{o} = P_{i} + u (F_{i - 1}, Δ_{i}),

(1)

Δ_{i} = u_{o} (P_{i}, F_{i - 1}),

(2)

the

u_{o}

means the process of depth separable convolution learning offset.

Assuming that the spatial coordinates for each location on the feature map are

{(1, 1)

,

(1, 2), \dots, (H, W)}

, and the offset map is

Δ \in R^{2 \times H \times W}

, the output of the offset alignment function

u (F, Δ)

is:

u_{h, w} = \sum_{h^{'} = 1}^{H} \sum_{w^{'} = 1}^{W} F_{h^{'} w^{'}} \cdot m a x (0, 1 - ∣ h + Δ_{1 h w} - h^{'} ∣) \cdot m a x (0, 1 - ∣ w + Δ_{2 h w} - w^{'} ∣),

(3)

Δ_{1 h w}

and

Δ_{2 h w}

denotes the two-dimensional offset of the learned position

(h, w)

.

3.2.2. Bidirectional Path Aggregation Network

The deep features in the convolutional network contain richer semantic features, and the shallow features contain more location information. However, this feature representation brings challenges to object detection tasks. Although the deep layer can capture semantic features, the spatial resolution of its feature map is low, which is unfavorable for precise detection of the target’s location information. This constraint is especially pronounced during detecting small objects. In addition, the shallow layer contains a higher amount of positional information, yet it is deficient in semantic features, which leads to suboptimal outcomes in image classification tasks. To address this issue, a Feature Pyramid Network (FPN) is utilized, which implements a top-down information propagation strategy to efficiently integrate and represent features across multiple scales. Recursive-FPN [59] introduced a recursive feature-fusion technique to handle multi-scale features more effectively. To enhance the multi-scale feature fusion within FPN, BiFPN [60] brings forward a bidirectional feature-fusion approach. Drawing inspiration from the BiFPN architecture and integrating the PANet structure from the neck section of YOLOv8, we introduce a novel approach termed the Bidirectional Path Fusion Network (BPAN). It fuses the original input nodes at the same resolution as the output node features and fuses more features without increasing too much cost to promote the flow of information. The architectures of both BiFPN and the proposed BPAN are depicted in Figure 6.

The BPAN module we proposed receives the four feature maps produced by the backbone as its input, represented from bottom to top as

P_{2} - P_{5}

Among them, the feature map of

P_{2}

possesses the highest resolution, with each following feature map having half the resolution of its predecessor. We draw inspiration from the BiFPN model and opt to integrate bidirectional skip connections within the intermediate layers of the module. This approach not only effectively captures feature information across different scales but also mitigates concerns such as excessive parameter growth, oversized model complexity, and feature degradation. Concurrently, we introduce an additional shallow detection head to the existing three, thereby enhancing the detection precision for small targets further.

3.3. MSFF-OSNet

When tracking vehicle targets in UAV videos, factors such as distance variation and angle changes cause the same vehicle to exhibit different appearances at different times, increasing intra-class variability in vehicle appearance features and affecting the accuracy of multi-object matching. To boost the distinctive characteristics of the targets, we propose a multi-scale feature fusion appearance feature-extraction network based on the OSNet backbone (MSFF-OSNet). This network incorporates a feature pyramid structure in the backbone to extract feature information representing different semantics and scales from various stages of the backbone. Additionally, the CBAM attention mechanism is introduced during the fusion of multilevel features to focus on more salient information, thereby extracting more robust features. These features are then used for appearance feature matching and association, effectively reducing issues such as ID switching caused by vehicle scale variations. The network structure is depicted in Figure 7.

OSNet employs multiple Lite layers to form multiple convolutional stream branches, thereby bolstering its capability to comprehend objects across different levels of detail. The unified aggregation gate (AG) improves multi-scale feature learning and adaptability to different input images, thereby enhancing performance across scales. Additionally, OSNet incorporates depthwise separable convolutions, significantly decreasing the number of parameters in the model. Overall, the lightweight OSNet effectively combines multiple uniform scale features with dynamic aggregation gates and demonstrates exceptional capability in learning non-uniform scale features. Consequently, in scenarios involving variable-scale vehicle tracking, replacing the original ResNet50 re-identification network in the tracker with the OSNet network yields superior performance.

The feature pyramid is an hourglass-shaped structure composed of two pathways: bottom-up and top-down. The bottom-up pathway typically consists of a convolutional network for feature extraction, where the feature map scale decreases progressively, enabling the detection of higher-level structures and increasing the semantic information of the network layers. In contrast, the top-down pathway constructs higher-resolution feature maps based on layers with richer semantics. The standard feature pyramid upsamples high-level, low-scale features and complements them with low-level, high-scale feature information, thereby achieving the extraction of features at different scales. The proposed multi-scale feature-extraction network in this study removes the upsampling part of the standard feature pyramid. When fusing features of different scales, it captures key information through the adaptive feature recalibration of an attention mechanism, better integrating shallow detail features with deep semantic features.

Specifically, the OSNet network fuses feature maps C1 to C5 of different scales from Conv1 to Conv5, and the feature-fusion module is depicted in Figure 8. The features C2 and C3 are upsampled and adjusted in channel number using a 1 × 1 convolutional layer to match C1 and C2, respectively, while the features C4 and C5 are adjusted in channel number using a 1 × 1 convolutional layer to match C3 and C4, respectively. Taking the fusion of C1 and C2 as an example, C2 is upsampled and adjusted in channel number using a 1 × 1 convolutional layer to match C1, resulting in D2. The enhanced features of C1 obtained through the CBAM attention module are then element-wise added to D2 and passed to the next layer. Similarly, the same operation is performed on subsequent feature maps to generate the ultimate output feature map.

CBAM comprises two separate sub-modules, CAM (Channel Attention Module) and SAM (Spatial Attention Module), which reconstruct the feature map to highlight key information. CAM emphasizes the interconnections among the channels of the feature map, evaluating the importance of each pixel. After processing through global max pooling, average pooling, and a multilayer perceptron (MLP), the input feature map undergoes dimensionality reduction and expansion, followed by summation and Sigmoid activation, ultimately yielding the channel attention

M_{c} (C 1)

. CAM can be expressed as:

M_{c} (C 1) = σ \{M L P [A v g (C 1)] + M L P [M a x (C 1)]\} .

(4)

The max pooling and average pooling operations in the SAM module serve to compress the channels, reducing their number to one after convolution, followed by Sigmoid activation. This enables the network model to focus more on the important pixel regions while disregarding less relevant areas. SAM can be expressed as:

M_{s} (C 1) = σ (f^{5 \times 5} [A v g (C 1); M a x (C 1)]) .

(5)

In the above two equations,

C 1

represents the input feature map,

A v g

denotes average pooling,

M a x

represents global max pooling,

M L P

stands for the multilayer perceptron,

σ {}

denotes the Sigmoid activation function, and

f^{5 \times 5}

represents a convolutional kernel with a side length of 5.

Therefore, the overall expression of the CBAM module can be derived as:

C 1^{'} = (M_{c} (C 1) \cdot C 1) \cdot M_{s} .

(6)

C 2

is upsampled and adjusted in channel number to obtain

D 2

, expressed as follows:

D 2 = f^{1 \times 1} (u p s a m p l e (C 2)) .

(7)

Finally, the output feature F obtained from the feature-fusion module after processing

C 1

and

C 2

is expressed as follows:

F = C 1^{'} + D 2 .

(8)

3.4. Matching Strategy

In this part, we offer a detailed description of the matching approach used by our tracking system to correlate identified objects between successive frames. Our research is founded on the framework of the BoT-SORT tracker.

On the one hand, BoT-SORT proposes a method that integrates IoU and ReID through cosine distance fusion, effectively addressing the issue of ID switches caused by appearance variations throughout the tracking process. Specifically, the algorithm integrates the IoU distance matrix and the cosine distance matrix for matching. The IoU distance indicates the level of intersection between the bounding boxes of the targets, while the cosine similarity measures the resemblance between the appearance features of targets. During the matching process, BoT-SORT initially processes the IoU scores, rejecting candidate boxes with low cosine similarity or those that are spatially distant. In other words, the algorithm excludes candidate targets that exhibit significant deviations in appearance or minimal positional overlap with the current trajectory. Subsequently, The smallest value from each element in the matrix is used as the ultimate value for the cost matrix C. The mathematical expression for the IoU-ReID fusion pipeline is as follows:

C_{i, j} = m i n {d_{i, j}^{i o u}, {\hat{d}}_{i, j}^{c o s}},

(9)

d_{i, j}^{cos} = \{\begin{matrix} 0.5 \cdot d_{i, j}^{c o s}, (d_{i, j}^{c o s} < θ_{e m b}) \land (d_{i, j}^{i o u} < θ_{i o u}) \\ 1, otherwise \end{matrix},

(10)

where

C_{i, j}

denotes the

(i, j)

element of the cost matrix C.

d_{i, j}^{i o u}

represents the IoU distance between the predicted bounding box of the i-th tracklet and the bounding box of the j-th detection, quantifying the motion cost.

d_{i, j}^{c o s}

represents the cosine distance between the mean appearance descriptor of the i-th tracklet and the descriptor of the j-th detection.

{\hat{d}}_{i, j}^{c o s}

is the newly defined appearance cost. The neighborhood threshold

θ_{i o u}

, set to 0.5, is used to discard improbable pairs of tracklets and detections. The appearance threshold

θ_{e m b}

, set to 0.25, distinguishes positive associations between tracklet appearance states and detection embedding vectors from negative ones.

On the other hand, while most trackers employ a Kalman filter state vector as represented by (11), BoT-SORT utilizes a state vector as defined by (12), which directly estimates the width and height of the targets, making it more adaptable to changes in aspect ratio when the target undergoes scale variations.

x = {[x_{c}, y_{c}, a, h, \dot{x_{c}}, \dot{y_{c}}, \dot{a}, \dot{h}]}^{⊤},

(11)

x = {[x_{c}, y_{c}, w, h, \dot{x_{c}}, \dot{y_{c}}, \dot{w}, \dot{h}]}^{⊤} .

(12)

Overall, the FB-YOLOv8 model is first utilized to acquire the positions of target vehicles in video frames. Subsequently, the positional information is fed into BoT-SORT, which assigns a distinct identifier to every target vehicle and tracks them accordingly. The framework of the BoT-SORT tracker is depicted in Figure 9.

Firstly, the input image sequence is processed by FB-YOLOv8 to obtain detection results. By setting high-confidence and low-confidence thresholds, the predicted detection boxes are divided into three categories: high-confidence, low-confidence, and those below the low-confidence threshold, which are directly discarded. Detection boxes with confidence scores above high confidence are prioritized for the first round of association matching with the trajectory boxes predicted by the Kalman filter. This first round of matching considers both the IoU distance between the boxes and the cosine similarity of the target’s appearance features. A successful match indicates successful target tracking, and the target trajectory is updated. High-confidence detection boxes that fail to match existing trajectories may represent new targets, requiring the creation of new trajectories. The remaining unmatched trajectories are subsequently entered into a second phase of association matching with low-confidence detection boxes. This second phase of matching relies solely on the IoU distance. Upon successful matching, the corresponding trajectory is subsequently updated. Trajectories that are still unmatched following the second phase of matching are retained for up to 30 frames for potential re-tracking. If no match is found within these 30 frames, the trajectory is discarded.

4. Experiments

4.1. Datasets and Implementation Details

Our algorithm is trained and verified on the UAVDT dataset. UAVDT serves as an extensive and complex testing ground for large-scale unmanned aerial vehicle detection and tracking challenges, with a total of 100 video sequences, about 80,000 frames, and an image size of 1024 × 540. It is mainly used for three important basic tasks, namely target detection, single-target tracking, and multi-target tracking. There are 50 video sequences for multi-target tasks, which are divided into 30 training set videos and 20 test set videos. The target categories are divided into three classes, including car, truck, and bus.

The size of the training and test images of UAVDT is set to 1024 × 1024, and the batch size is set to 4. The optimizer used is SGD, the learning rate is 0.01, the momentum is 0.9, and the weight attenuation coefficient is 0.0005. The number of training epochs is set to 300, which converges after more than 50 rounds. The experimental computer is furnished with a 12th Gen Intel (R) Core (TM) i7-12700 2.10 GHz processor and an NVIDIA RTX 3080 GPU equipped with 10 GB of memory. The experimental platform is Pycharm software (version 2022.1), and the framework is PyTorch 1.13.1.

4.2. Object Detector Experiments

To illustrate the efficacy of FB-YOLOv8, our approach is benchmarked against various cutting-edge models using the UAVDT dataset. The evaluation indicators include mean average precision at the IoU threshold of 0.5 (mAP50), parameters (params), and giga floating point operation (GFLOP). The outcomes of the comparison are listed in Table 1. The PR curves of the baseline model and the improved model are shown in Figure 10. Our method attains 34.7% in mAP50, which is improved from 32.2% to 34.7% compared with the baseline method YOLOv8n. Regarding resource requirements, our approach exhibits a computational complexity of 12.6 GFLOPs and encompasses 2.97 million parameters, which is 0.04 million less than the original YOLOv8n model parameter number. Our method surpasses other approaches, boasting the top mAP50 metric.

The ablation study on the UAVDT test set demonstrates the contributions of the FAAM and BPAN modules in FB-YOLOv8. The experimental results are shown in Table 2. The baseline model achieves an mAP50 of 32.2%. Adding FAAM alone improves mAP50 to 33.1%, with gains in car (68.2%) and truck (7.1%) detection. BPAN alone raises mAP50 to 33.4%, primarily enhancing car detection (69.9%). Combining both modules yields the highest mAP50 of 34.7%, with car detection peaking at 75.5% and truck detection at 7.2%, though bus detection slightly drops to 21.4%. This highlights the complementary role of FAAM and BPAN in improving overall detection performance, particularly for cars and trucks.

The results in Table 1 are tested on the test set officially released by UAVDT. Since our method primarily enhances the handling of multi-scale variations of targets in UAV images, and some of the video sequences have environmental factors such as fog, the mAP50 on the test set is only 2.5% higher than the baseline. To evaluate the effect of this algorithm on multi-scale-variation scenarios and screen out video sequences with environmental factors such as fog, four video sequences—M1001, M0801, M0802, and M1301—are used as new test sets. The outcomes of the tests are presented in Table 3.

Due to the utilization of our feature alignment aggregation module, our method enhances the effect of feature fusion and improves the detection capability for targets of different scales. The mAP50 value of FB-YOLOv8 reaches 54.7%, which represents a 4.4% improvement over the baseline method, proving the efficiency of our approach in multi-scale-variation scenarios. In these video sequences, where small targets constitute the majority, the alignment operation in our method enables features of different resolutions to spatially correspond, thus enhancing the ability of the model to process both detailed and global information. This greatly enhances the model’s capability to accurately localize small targets, resulting in a substantial increase in detection accuracy. Furthermore, we test the four video sequences of M1001, M0801, M0802, and M1301, respectively, and the outcomes are presented in Table 4. The results indicate that our method outperforms the baseline method on each video sequence with multi-scale variation.

YOLOv8n and our proposed method detect the visualization results on the UAVDT dataset as depicted in Figure 11. Our model performs better in multi-scale-changing scenes, can more accurately identify distant, dense small targets, and can better adapt to complex environmental conditions.

4.3. Re-Identification Experiments

The vehicle re-identification experiment aims to tackle the problem caused by the alterations in the appearance of the same vehicle as a result of varying camera angles and distances. These variations tend to diminish the efficacy of traditional object detection and tracking methods. By training the ReID model to correlate the diverse appearance motifs of the same vehicle across various scenes, the recognition accuracy and tracking stability in scenarios with scale variations are enhanced.

We generate a vehicle re-identification dataset with scale variations following the format of the Market-1501 dataset and then conduct experiments on this dataset. We selected sequences from the UAVDT training set, including M0101, M0201, M0501, M0605, M0703, M1304, M1305 and M1306, which represent diverse scenarios. The annotated targets were cropped and organized into a vehicle ReID dataset following the format of the Market-1501 dataset. By localizing and cropping vehicles from 5586 original images, a total of 97,890 images were obtained, encompassing 431 vehicle instances. A comparison between the original images and re-identification data in the UAVDT dataset is illustrated in Figure 12.

The MSFF-OSNet model proposed by us was employed to train this vehicle re-identification data, resulting in the acquisition of trained weights. The mAP-Epoch curves obtained from the OSNet network model and the MSFF-OSNet network model training are shown in Figure 13. Based on the curve, it can be seen that the mAP value of MSFF-OSNet on the re-identification dataset is 55.5%, which is 4.3% higher than the mAP value of OSNet. This indicates that the re-recognition accuracy of MSFF-OSNet is improved. In MSFF-OSNet, we use a convolutional attention mechanism to guide the re-identification network in adaptively capturing salient vehicle cues. In Figure 14, we present the activation map visualizations and performance graphs of MSFF-OSNet and OSNet. It is evident that the OSNet re-identification network struggles to concentrate on key features and may catch something salient but unrelated to the vehicle without guidance. Therefore, based on the analysis of the activation visualization maps, it can be concluded that our proposed improved re-identification network is capable of extracting more fine-grained and discriminative features for targets with varying angles and levels of clarity.

The ablation studies are conducted to quantitatively assess the impact of integrating FPN and CBAM into the MSFF-OSNet framework. The results are presented in Table 5. It demonstrates that compared to the baseline method, adding only FPN improves IDF1 by 0.4% and reduces the number of IDSW from 1072 to 654. Further incorporating CBAM on this basis enhances IDF1 and MOTA by an additional 1.3% and 0.4%, respectively, and reduces the number of identity switches from 1072 to 552. These findings substantiate our claim that the combination of FPN and CBAM effectively mitigates the identity-switching problem.

4.4. Tracking Experiments

In order to facilitate an extensive comparison, we have performed a series of experiments on the UAVDT dataset and benchmarked the outcomes against other well-established algorithms. The results are detailed in Table 6. The UAVDT dataset includes a range of difficult scenarios, such as fast-moving targets, intricate backgrounds, and varying scales, which impose stringent requirements on MOT algorithms. The results demonstrate that our tracker attains optimal performance across various metrics, particularly in MOTA, IDF1, and IDSW.

The findings from our experiments indicate that our tracking system exhibits impressive results on the UAVDT dataset, yielding MOTA and IDF1 scores of 48.7% and 80.8%, respectively. Significantly, our tracker surpasses the current leading trackers in performance. For instance, our method significantly surpasses Deep OC-SORT, with MOTA increasing substantially from 39.9% to 48.7% and IDF1 improving markedly from 79.9% to 80.8%. FairMOT, a representative high-precision one-shot tracker, is also exceeded by 7.1% in MOTA and achieves a high IDF1 score (80.8% compared to 80.3%). AsyUAV is a method specialized for multi-target tracking in UAV view, and our method outperforms it by 0.7% in MOTA and 71 in IDSW reduction. Furthermore, compared to other trackers, our method significantly decreases the occurrences of missed detections and ID switches. These results indicate that our approach efficiently boosts tracking accuracy and stability.

4.5. Ablation Study

To assess the effectiveness of the two primary networks, FB-YOLOv8 and MSFF-OSNet, in the complete tracking system, ablation studies are conducted. Specifically, FB-YOLOv8 is primarily employed to improve detection capabilities, while MSFF-OSNet is used to improve the association accuracy of the same target at diverse scales. The results are summarized in Table 7. To minimize the influence of other environmental factors, such as rain and fog, we selected several sequences, M1001, M0801, M0802, and M1301, with distinct scale-variation characteristics as a new test set, further validating the effectiveness of our method in addressing scale-variation challenges. The test results obtained are displayed in Table 8.

4.5.1. Effectiveness of FB-YOLOv8

FB-YOLOv8 is responsible for generating information, such as the location and confidence of the detected targets. From the initial two rows of Table 7, it is clear that incorporating FB-YOLOv8 into the baseline model leads to a notable enhancement in MOTA, rising from 40.2% to 48.1%, while IDF1 also sees an uptick from 61.8% to 67.9%. These improvements demonstrate the efficacy of our FB-YOLOv8 module in boosting detection capabilities and tracking accuracy. From the initial two rows of Table 8, the addition of FB-YOLOv8 markedly enhances the precision of target tracking in scenarios of varying scale, with MOTA rising from 58.8% to 66.7%.

4.5.2. Effectiveness of MSFF-OSNet

The MSFF-OSNet has effectively augmented the stability of the tracker amid scale variations. As delineated in the first and third rows of Table 7, MSFF-OSNet has improved IDF1 scores from 61.8% to 63.5% and reduces the number of identity switches from 1072 to 552, demonstrating its capability to enhance the accuracy of target ID association and improve tracking stability. Moreover, using FB-YOLOv8 with MSFF-OSNet further enhances tracking performance significantly. The fourth row of Table 7 illustrates that our tracking system achieves a MOTA score of 48.7%, an IDF1 score of 67.9%, and decreases the IDSW from 1072 to 278. As depicted in the first and third rows of Table 8, the addition of MSFF-OSNet significantly improves target tracking stability in scale-varying scenarios, reducing association matching errors, with IDF1 increasing from 73.6% to 75.1% and IDSW decreasing from 198 to 152. The combination of the two modules results in an increase in MOTA from 58.8% to 67.1%, a rise in IDF1 from 73.6% to 77.3%, and a reduction in IDSW by 112 instances. This demonstrates that our tracker has achieved improvements in both tracking accuracy and stability under scale-varying scenarios.

4.6. Visualization and Analysis

To clearly illustrate the benefits of our tracker in scenarios with scale variations, we have compared the visual results of our proposed tracker and the baseline tracker across different UAV motion patterns, as depicted in Figure 15. When vehicles traverse an intersection, the relative viewing angle with respect to the drone changes, leading to significant alterations in the appearance of the same vehicle. As depicted in the first scenario of Figure 15, vehicles with IDs 14, 29, and 24 are missed during the turning process at the intersection, and subsequently, ID switches occur when the targets are re-acquired. This indicates that the baseline method is susceptible to the influence of target-scale variations, whereas our approach demonstrates remarkable adaptability to changes in target appearance throughout the process, maintaining stable and continuous tracking. In contrast to the favorable visibility conditions during daylight, nighttime tracking presents a more formidable challenge. In the second nocturnal scenario, variations in the distance between the drone and the targets induce changes in the targets’ scale, a situation where the baseline method demonstrates inadequacies. Specifically, vehicles with ID numbers 18 and 200 are subject to missed detections and identity switches, respectively. However, our method is capable of consistently detecting and tracking these targets with stability. These findings prove that our method is not only proficient in accurately detecting targets undergoing scale variations but also able to maintain continuous and stable tracking of these targets.

5. Conclusions

To tackle the problem of scale variation in vehicle tracking from a UAV perspective, we propose a novel multi-object tracking method based on the TBD framework. To improve the accuracy of target localization in scenarios with multi-scale variations, we have enhanced the general-purpose object detection algorithm YOLOv8 by introducing a novel multi-scale feature alignment aggregation method named FB-YOLOv8. Furthermore, achieving accurate association matching on the basis of high-quality detection is also of paramount importance. To enhance the accuracy of association when targets undergo scale variations, we propose a multi-scale feature-fusion appearance feature-extraction network (MSFF-OSNet) based on the OSNet backbone. This re-identification network is capable of extracting discriminative features of the targets, enhancing the similarity among the same targets, and increasing the disparity among different targets, thereby improving the accuracy of target association. Owing to the integration of these two components, our tracker is adept at detecting targets and maintaining the continuity of their trajectories even in the challenging scenario of scale variation. On the one hand, an analysis of numerous publicly available vehicle datasets captured by UAV revealed that only the UAVDT dataset aligns with the scale-variation scenarios required for our study. Consequently, we have only utilized the UAVDT dataset for validation purposes. In the future, we intend to capture and compile a dataset with scale variations from a UAV perspective and then make it publicly available to facilitate academic research. On the other hand, while the MSFF-OSNet module improves tracking performance, it also increases computational load. For real-time tracking in UAV applications, computational resources are a critical factor to consider. Therefore, we plan to further investigate the lightweight version of the model in future research.

Author Contributions

Methodology, X.S., S.X. and H.T.; Software, S.L.; Investigation, H.L.; Resources, H.T.; Data curation, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 61901489, 61921001.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Wu, J. Extracting high-precision vehicle motion data from unmanned aerial vehicle video captured under various weather conditions. Remote Sens. 2022, 14, 5513. [Google Scholar] [CrossRef]
Jiménez-Bravo, D.M.; Murciego, Á.L.; Mendes, A.S.; San Blás, H.S.; Bajo, J. Multi-object tracking in traffic environments: A systematic literature review. Neurocomputing 2022, 494, 43–55. [Google Scholar] [CrossRef]
Sun, L.; Yang, Z.; Zhang, J.; Fu, Z.; He, Z. Visual Object Tracking for Unmanned Aerial Vehicles Based on the Template-Driven Siamese Network. Remote Sens. 2022, 14, 1584. [Google Scholar] [CrossRef]
Wu, Z.; Liu, Q.; Zhou, S.; Qiu, S.; Zhang, Z.; Zeng, Y. Learning Template-Constraint Real-Time Siamese Tracker for Drone AI Devices via Concatenation. Drones 2023, 7, 592. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-stream backbone for improved Faster R-CNN object detection and aerial tracking from UAV images. Remote Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
Ma, J.; Liu, D.; Qin, S.; Jia, G.; Zhang, J.; Xu, Z. An Asymmetric Feature Enhancement Network for Multiple Object Tracking of Unmanned Aerial Vehicle. Remote Sens. 2023, 16, 70. [Google Scholar] [CrossRef]
Abbaspour, M.; Masnadi-Shirazi, M.A. Online multi-object tracking with δ-glmb filter based on occlusion and identity switch handling. Image Vis. Comput. 2022, 127, 104553. [Google Scholar] [CrossRef]
Wang, G.; Song, M.; Hwang, J.N. Recent advances in embedding methods for multi-object tracking: A survey. arXiv 2022, arXiv:2205.10766. [Google Scholar]
Varga, L.A.; Koch, S.; Zell, A. Comprehensive analysis of the object detection pipeline on UAVs. Remote Sens. 2022, 14, 5508. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3025–3029. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 107–122. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Qin, Z.; Chen, D.; Wang, H. MCA-YOLOv7: An Improved UAV Target Detection Algorithm Based on YOLOv7. IEEE Access 2024, 12, 42642–42650. [Google Scholar] [CrossRef]
Shi, Y.; Wang, C.; Xu, S.; Yuan, M.D.; Liu, F.; Zhang, L. Deformable Convolution-Guided Multiscale Feature Learning and Fusion for UAV Object Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Chu, R.; Sun, Y.; Li, Y.; Liu, Z.; Zhang, C.; Wei, Y. Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8282–8291. [Google Scholar]
Teng, S.; Zhang, S.; Huang, Q.; Sebe, N. Multi-view spatial attention embedding for vehicle re-identification. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 816–827. [Google Scholar] [CrossRef]
Tan, W.; Yan, B.; Bare, B. Feature super-resolution: Make machine see more clearly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3994–4002. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 March 2025).
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Sebe, N. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar] [CrossRef]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Li, X.; Zhu, R.; Yu, X.; Wang, X. High-Performance Detection-Based Tracker for Multiple Object Tracking in UAVs. Drones 2023, 7, 681. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Wang, L.; Xuan, S.; Qin, X.; Li, Z. Multi-object tracking method based on dual-decoder Transformer. J. Comput. Appl. 2023, 43, 1919. [Google Scholar]
Kapania, S.; Saini, D.; Goyal, S.; Thakur, N.; Nagrath, P. Multi Object Tracking with UAVs using Deep SORT and YOLOv3 RetinaNet Detection Framework. In Proceedings of the AIMS ’20: The 1st ACM Workshop on Autonomous and Intelligent Mobile Systems, Bangalore, India, 11 January 2020. [Google Scholar]
Wu, H.; Du, C.; Ji, Z.; Gao, M.; He, Z. SORT-YM: An Algorithm of Multi-Object Tracking with YOLOv4-Tiny and Motion Prediction. Electronics 2021, 10, 2319. [Google Scholar] [CrossRef]
Ning, M.; Ma, X.; Lu, Y.; Calderara, S.; Cucchiara, R. SeeFar: Vehicle Speed Estimation and Flow Analysis from a Moving UAV; Springer: Cham, Switzerland, 2022. [Google Scholar]
Gao, X.; Shen, Z.; Yang, Y. Multi-object tracking with Siamese-RPN and adaptive matching strategy. Signal, Image Video Process. 2022, 16, 965–973. [Google Scholar] [CrossRef]
Pramanik, A.; Pal, S.K.; Maiti, J.; Mitra, P. Granulated RCNN and multi-class deep sort for multi-object detection and tracking. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 171–181. [Google Scholar] [CrossRef]
Yuan, Y.; Wu, Y.; Zhao, L.; Chen, H.; Zhang, Y. Multiple object detection and tracking from drone videos based on GM-YOLO and multi-tracker. Image Vis. Comput. 2024, 143, 104951. [Google Scholar] [CrossRef]
Zheng, G.; Fu, C.; Ye, J.; Li, B.; Lu, G.; Pan, J. Scale-Aware Siamese Object Tracking for Vision-Based UAM Approaching. IEEE Trans. Ind. Inform. 2023, 19, 9349–9360. [Google Scholar] [CrossRef]
Liu, X.; Xu, T.; Wang, Y.; Yu, Z.; Yuan, X.; Qin, H.; Li, J. BACTrack: Building Appearance Collection for Aerial Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 5002–5017. [Google Scholar] [CrossRef]
Huang, W.; Zhou, X.; Dong, M.; Xu, H. Multiple objects tracking in the UAV system based on hierarchical deep high-resolution network. Multimed. Tools Appl. 2021, 80, 13911–13929. [Google Scholar] [CrossRef]
Fu, H.; Guan, J.; Jing, F.; Wang, C.; Ma, H. A Real-Time Multi-Vehicle Tracking Framework in Intelligent Vehicular Networks. China Commun. 2021, 18, 89–99. [Google Scholar] [CrossRef]
Lu, M.; Xu, Y.; Li, H. Vehicle Re-Identification Based on UAV Viewpoint: Dataset and Method. Remote Sens. 2022, 14, 4603. [Google Scholar] [CrossRef]
Xu, S.; Cheng, Y.; Gu, K.; Yang, Y.; Zhou, P. Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Chen, D.; Li, H.; Xiao, T.; Yi, S.; Wang, X. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1169–1178. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10407–10416. [Google Scholar]
Huang, Z.; Wei, Y.; Wang, X.; Liu, W.; Huang, T.S.; Shi, H. Alignseg: Feature-aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 550–557. [Google Scholar] [CrossRef] [PubMed]
Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Jocher, G. ultralytics/yolov5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 11 March 2025).
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics/UltralyticsYOLO11 (accessed on 11 March 2025).
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Li, J.; Ding, Y.; Wei, H.L.; Zhang, Y.; Lin, W. Simpletrack: Rethinking and improving the jde approach for multi-object tracking. Sensors 2022, 22, 5863. [Google Scholar] [CrossRef]
Liu, S.; Li, X.; Lu, H.; He, Y. Multi-object tracking meets moving UAV. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8876–8885. [Google Scholar]

Figure 1. The left column of images depicts the video sequence M1302 from the UAVDT dataset, where the vehicle target indicated by the bounding box changes in size; the second column of images depicts the video sequence M0802 from the UAVDT dataset, where the vehicle target delineated by the bounding box experiences a change in perspective as it turns at an intersection. These scale variations result in alterations to the appearance of the same target.

Figure 2. Framework illustration of the proposed tracker. In the detection phase, the FB-YOLOv8 detector is employed to enhance detection accuracy, while in the ReID phase, the MSFF-OSNet re-identification network is utilized to extract more fine-grained appearance features, thereby improving feature discriminability.

Figure 3. Architecture diagram of YOLOv8.

Figure 4. Architecture diagram of FB-YOLOv8.

Figure 5. Structure of Feature Alignment Aggregation Module (FAAM).

Figure 6. Structure of BiFPN and BPAN.

Figure 7. Architecture diagram of multi-scale feature-fusion appearance feature-extraction network (MSFF-OSNet). On the basis of the backbone network OSNet, a Feature Pyramid Network (FPN) is introduced to integrate multilevel features and a CBAM attention module is incorporated into the feature-fusion model (FFM) to guide the focus on more salient information.

Figure 8. Architecture diagram of feature-fusion module (FFM).

Figure 9. The framework of the BoT-SORT tracker.

Figure 10. The PR curves of the baseline model and the improved model.

Figure 11. The visualization results on the UAVDT dataset. In sequences M1001, M0802, and M1302, the yellow bounding boxes highlight instances where distant small targets were missed.

Figure 12. The comparison of original and re-identification data in the UAVDT dataset.

Figure 13. Comparison of OSNet and MSFF-OSNet mAP-Epoch plots on the re-identification dataset.

Figure 14. Activation map visualization of MSFF-OSNet and OSNet. The red bounding boxes in the original images highlight the discriminative features of the target vehicles. Vehicle images in neighboring rows belong to different identities. We have noticed that without proper guidance, the OSNet network fails to focus on discriminative cues and captures some vehicle-irrelevant content, such as the background.

Figure 15. The comparison between our method and the baseline in scale-varying scenarios.

Table 1. Results of the FB-YOLOv8 and comparative approaches on the UAVDT datasets (best results are in bold).

Network	Params(M)	GFLOPs	Car	Truck	Bus	mAP50
YOLOv3-Tiny [29]	8.68	12.9	61.4	12.9	6.1	26.9
YOLOv4-Tiny [30]	5.89	7.0	63.5	13.4	6.4	27.7
YOLOv5-N [61]	1.77	4.2	66.7	14.7	7.1	29.5
YOLOv7-Tiny [32]	6.03	13.1	70.4	16.8	6.8	31.2
YOLO11n [62]	2.58	6.3	64.2	5.0	19.7	29.6
YOLOv8n [22]	3.01	8.2	68.0	4.8	23.7	32.2
FB-YOLOv8	2.97	12.6	75.5	7.2	21.4	34.7

Table 2. Ablation study on UAVDT test set to assess the efficacy of the distinct proposed modules in FB-YOLOv8. Where “✓” denotes the baseline model that includes the respective module.

Baseline	BiFPN	FAAM	BPAN	Car	Truck	Bus	mAP50
✓	✓			68.0	4.8	23.7	32.2
✓	✓	✓		68.2	7.1	24.1	33.1
✓			✓	69.9	6.9	23.6	33.4
✓	✓	✓	✓	75.5	7.2	21.4	34.7

Table 3. Results of FB-YOLOv8 and YOLOv8n on the new test set (best results are in bold).

Network	mAP50
Network	Car	Truck	Bus	All
YOLOv8n	83.4	6.24	61.3	50.3
FB-YOLOv8	81.6	15.6	67	54.7

Table 4. Results of FB-YOLOv8 and YOLOv8n on the four video sequences: M1001, M0801, M0802 and M1301 (best results are in bold).

Network	mAP50
Network	M1001	M0801	M0802	M1301
YOLOv8n	48.7	70.5	50.7	60
FB-YOLOv8	57.4	83.5	62.6	60.1

Table 5. Ablation study on UAVDT test set to assess the efficacy of the distinct proposed modules in MSFF-OSNet. Where “✓” denotes the baseline model that includes the respective module.

Baseline	FPN	CBAM	MOTA	MOTP	IDF1	IDSW
✓			40.2	79.4	61.8	1072
✓	✓		40.1	79.6	62.2	654
✓	✓	✓	40.5	79.4	63.5	552

Table 6. Comparison with state-of-the-art trackers on UAVDT benchmarks (best results are in bold).

Tracker	MOTA	MOTP	IDF1	IDSW	FP	FN
DeepSORT [11]	36.2	79.7	57.9	1626	106,294	158,338
Deep OC-SORT [12]	39.9	79.9	62.0	1206	69,034	156,683
StrongSORT [63]	36.9	80.1	51.8	3760	108,283	183,851
JDE [14]	38.6	78.2	57.1	2320	82,771	124,156
FairMOT [13]	41.6	80.3	63.3	1145	69,976	128,055
SimpleTrack [64]	45.3	77.9	57.1	1404	39,452	138,457
UAVMOT [65]	46.4	72.7	67.3	456	66,352	115,940
AsyUAV [6]	48.0	-	67.5	349	46,571	130,121
Ours	48.7	80.8	67.9	278	57,791	116,797

Table 7. Ablation study on UAVDT test set to assess the efficacy of the distinct proposed modules within our tracker. Where “✓” denotes the baseline model that includes the respective module.

Baseline	FB-YOLOv8	MSFF-OSNet	MOTA	MOTP	IDF1	IDSW
✓			40.2	79.4	61.8	1072
✓	✓		48.1	80.3	66.9	719
✓		✓	40.5	79.4	63.5	552
✓	✓	✓	48.7	80.8	67.9	278

Table 8. Ablation study on UAVDT new test set to assess the efficacy of the distinct proposed modules within our tracker. Where “✓” denotes the baseline model that includes the respective module.

Baseline	FB-YOLOv8	MSFF-OSNet	MOTA	MOTP	IDF1	IDSW
✓			58.8	80.3	73.6	198
✓	✓		66.7	83.5	77.2	170
✓		✓	59.0	80.3	75.1	152
✓	✓	✓	67.1	83.2	77.3	86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Shen, X.; Xiao, S.; Li, H.; Tao, H. A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos. Remote Sens. 2025, 17, 1014. https://doi.org/10.3390/rs17061014

AMA Style

Liu S, Shen X, Xiao S, Li H, Tao H. A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos. Remote Sensing. 2025; 17(6):1014. https://doi.org/10.3390/rs17061014

Chicago/Turabian Style

Liu, Shanshan, Xinglin Shen, Shanzhu Xiao, Hanwen Li, and Huamin Tao. 2025. "A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos" Remote Sensing 17, no. 6: 1014. https://doi.org/10.3390/rs17061014

APA Style

Liu, S., Shen, X., Xiao, S., Li, H., & Tao, H. (2025). A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos. Remote Sensing, 17(6), 1014. https://doi.org/10.3390/rs17061014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Scale Feature-Fusion Multi-Object Tracking Algorithm for Scale-Variant Vehicle Tracking in UAV Videos

Abstract

1. Introduction

2. Related Works

2.1. Object Detection Algorithm

2.2. Multiple Object Tracking

2.3. Re-Identification Algorithm

3. Methodology

3.1. Overall Framework

3.2. FB-YOLOv8

3.2.1. Feature Alignment Aggregation Module

3.2.2. Bidirectional Path Aggregation Network

3.3. MSFF-OSNet

3.4. Matching Strategy

4. Experiments

4.1. Datasets and Implementation Details

4.2. Object Detector Experiments

4.3. Re-Identification Experiments

4.4. Tracking Experiments

4.5. Ablation Study

4.5.1. Effectiveness of FB-YOLOv8

4.5.2. Effectiveness of MSFF-OSNet

4.6. Visualization and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI