An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery

Han, Jianfeng; Sun, Feijie; Xu, Zihan; Song, Lili; Fang, Jiandong

doi:10.3390/rs18101547

Open AccessArticle

An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery

by

Jianfeng Han

^1,2,3,

Feijie Sun

^1,2,

Zihan Xu

^1,2,

Lili Song

^1,2,* and

Jiandong Fang

^1,2

¹

School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010080, China

²

Inner Mongolia Autonomous Region Key Laboratory of Intelligent Perception and System Engineering, Hohhot 010080, China

³

Intelligent Manufacturing Modern Industry College, Inner Mongolia University of Technology, Hohhot 010051, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1547; https://doi.org/10.3390/rs18101547

Submission received: 7 April 2026 / Revised: 9 May 2026 / Accepted: 11 May 2026 / Published: 13 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed TCYOLO detector integrates enhanced feature modules and cross-scale fusion architecture, achieving superior small target detection in UAV scenarios while maintaining efficiency.
We designed the SofByteTrack tracker with sparse optical flow compensation to mitigate UAV motion-induced tracking drift and identity switches in fast-moving scenarios.

What are the implications of the main findings?

Through feature enhancement mechanisms, cross-scale fusion architecture, and sparse optical flow motion compensation strategies, significantly improves small target detection accuracy and tracking stability.
The technical solutions are applicable to UAVs and mobile vision systems, establishing foundations for future research and supporting intelligent development.

Abstract

In vision-based low-altitude unmanned aerial vehicle (UAV) remote sensing, detecting small targets accurately and maintaining stable tracking under fast-motion conditions remain significant challenges. Specifically, small-object detection suffers from low feature representation, while camera motion often induces tracking drift and identity switches. To address these issues, this paper proposes a novel small target detection and tracking algorithm named TCYOLO-SofByteTrack, which integrates an improved YOLOv11 with ByteTrack. The algorithm comprises two core innovative modules: First, the TCYOLO detector is designed by integrating the C3k2-TA feature enhancement module with triplet attention mechanism to achieve cross-dimensional interaction modeling, significantly improving small target feature representation capability and network contextual awareness. A Cross-Scale Feature Fusion Module for UAVs (CCFM-UAV) is constructed to provide precise detection support for small targets at different scales. Second, building upon the ByteTrack framework, the SofByteTrack tracker is designed, which introduces a sparse optical flow-based motion compensation strategy. This strategy estimates and compensates for image displacement caused by UAV motion in real time, ensuring the stability of target bounding boxes under fast-motion conditions, thereby effectively mitigating tracking drift and identity switches. Experimental results demonstrate that the TCYOLO detector achieves a 7.4% improvement in mAP for small target detection compared to the baseline YOLOv11 model. The complete TCYOLO-SofByteTrack tracking algorithm achieves a HOTA score of 45.3%, MOTA of 42.7%, and IDF1 of 57.8%, representing improvements of 4.5%, 5.9%, and 8.0%, respectively, over the baseline methods. Furthermore, the number of successfully tracked targets increased by 37.3%, while identity switches decreased by 23.4%. These results demonstrate the notable advantages of the proposed method in small target detection accuracy, tracking precision, and identity consistency. Its generalization capability is further validated on a custom highway inspection dataset. Moreover, deployment tests on an NVIDIA Jetson Orin NX platform show that, compared to YOLOv11n, the proposed algorithm achieves higher detection accuracy while still meeting real-time processing requirements, highlighting its practical applicability in resource-constrained scenarios.

Keywords:

UAV remote sensing; small target detection and tracking; TCYOLO; SofByteTrack; sparse optical flow

1. Introduction

In recent years, with the rapid advancement and widespread adoption of unmanned aerial vehicle (UAV) technology, UAVs have become indispensable tools in fields such as military reconnaissance, disaster rescue, traffic surveillance, and environmental monitoring [1,2]. In these applications, vision-based UAV target detection and tracking technologies play a critical role, with their performance directly influencing the effectiveness and quality of mission execution. Target detection is responsible for identifying and localizing objects of interest in real time, while target tracking ensures continuous monitoring of specific targets in complex and dynamic environments. The synergistic operation of these two components forms the core of UAV vision-based perception modules and provides essential support for autonomous navigation applications.

In practical UAV target detection scenarios, deep learning-based models have become main choices due to their superior accuracy and efficiency. These methods are generally categorized into two-stage and single-stage detectors. Two-stage algorithms, such as Faster-RCNN, first generate candidate regions and then perform classification and regression [3]. While achieving high accuracy, they suffer from high computational overhead and slow inference speed, making them less suitable for real-time UAV applications. In contrast, single-stage detectors like the YOLO [4] series adopt an end-to-end architecture that directly predicts target locations and categories, offering a superior trade-off between speed and accuracy. Consequently, YOLO-based models have been widely adopted and customized for UAV scenarios.

Despite their success, existing detection algorithms face challenges in small-object recognition, particularly in low-altitude UAV scenes characterized by tiny target sizes, cluttered backgrounds, and dense distributions. To address these issues, Wang et al. [5] introduced the Focal Fast Network Block (FFNB) and integrated the BiFormer attention mechanism to enhance small target detection. However, this improvement introduced two additional detection layers, increasing model complexity and computational cost. Yue et al. [6] proposed a lightweight LE-YOLO algorithm that leverages depthwise separable convolutions and an enhanced neck structure with LGS bottleneck and LGSCSP modules to balance efficiency and accuracy. Wang et al. [7] presented an improved YOLOX-based detector, YOLOX_w, incorporating a path aggregation network, a lightweight subspace attention module, and an optimized loss function to boost detection performance. Nevertheless, these enhancements often lead to increased parameter counts and reduced inference speed, limiting their practicality in real-time UAV deployments.

Subsequent studies have further advanced UAV-based small-object detection. Yan et al. [8] proposed TOE-YOLO, which uses a C3k2-ARC module for rotated feature extraction and a CL-Concat module combining concatenation with channel and spatial attention, achieving notable performance improvements across multiple benchmarks while remaining lightweight. Deng et al. [9] introduced EHDC-YOLO, integrating a Multi-Scale Edge Enhancement module, an Enhanced FPN with P2-level features, and a Dynamic Head with multi-dimensional attention, yielding significant accuracy gains over strong baselines with a modest parameter increase. Han et al. [10] developed LRDS-YOLO, featuring a Light Adaptive-weight Downsampling module, a Re-Calibration FPN, and a dynamic detection head, striking an effective balance between accuracy and efficiency for real-time UAV deployment. These targeted improvements effectively address the challenges of small-object detection in complex UAV scenes.

Recent advances in remote sensing image (RSI) analysis have demonstrated that frequency-aware modeling and denoising-oriented feature [11] enhancement are effective for improving UAV visual perception tasks under complex environments. In UAV remote sensing scenarios, motion blur, sensor noise, and cluttered backgrounds often weaken the discriminability of small targets. To address these issues, frequency-domain feature learning [12] and spectral-spatial fusion [13] strategies have been increasingly explored in tasks such as object detection, semantic segmentation, and image fusion, as they help preserve high-frequency structural details while suppressing redundant background information.

Although the proposed TCYOLO-SofByteTrack framework does not explicitly employ frequency-domain transformations, its design philosophy is closely related to these recent advances. Specifically, the proposed C3k2-TA module enhances cross-dimensional feature interaction and suppresses irrelevant background responses through lightweight attention modeling, while the CCFM-UAV module improves multi-scale feature fusion and fine-grained information propagation. These mechanisms implicitly strengthen the representation of high-frequency target details and improve robustness against background interference in UAV remote sensing scenarios.

In the domain of Multiple Object Tracking (MOT), existing algorithms are primarily divided into two paradigms: Tracking by Detection (TBD) and Joint Detection and Tracking (JDT). TBD methods, such as Sort [14], employ Kalman filtering and the Hungarian algorithm for data association, offering simplicity and low computational cost. However, they are prone to failures in occlusion and crowded scenes. DeepSort [15] improves robustness by integrating appearance features, yet it still struggles with identity switches under complex backgrounds or rapid motion. ByteTrack [16] further refines association by leveraging low-confidence detection boxes, reducing identity switches and achieving more stable tracking performance. Despite these advances, existing trackers remain vulnerable to tracking loss and trajectory drift in scenarios involving fast-moving targets, occlusion, or significant appearance changes—common occurrences in UAV-based tracking [17].

Although progress has been made in both detection and tracking, several critical challenges remain unaddressed. First, the detection of small targets in UAV imagery is still hindered by high miss and false detection rates, as traditional detectors fail to capture fine-grained features in complex backgrounds [18]. Second, tracking performance degrades significantly under UAV motion, where rapid camera movement exacerbates issues such as identity switching and trajectory interruption. Moreover, this classic scenario, such as UAV autonomous target tracking, inherently integrates detection, tracking, and flight control, demanding their simultaneous execution. This imposes stringent constraints on computational resources, requiring a careful balance between high accuracy and low power consumption. Therefore, any practical solution must achieve favorable trade-offs between computational efficiency and the precision of detection and tracking. These limitations collectively constrain the reliability and applicability of UAV vision systems in complex mission scenarios [19].

To address these challenges, this paper proposes a multi-target detection and tracking algorithm named TCYOLO-SofByteTrack, built upon an improved YOLOv11n detector and the ByteTrack tracker [20]. The main contributions of this work are twofold:

1. We design TCYOLO, a detector that enhances small target representation by integrating the Triplet Attention [21] mechanism into the C3k2 module to construct a feature enhancement layer, and by proposing CCFM-UAV (Cross-Scale Fusion Module for UAV) to optimize multi-level feature integration. This improves detection accuracy for small objects while maintaining computational efficiency.

2. We extend ByteTrack by incorporating an image registration module based on Sparse Optical Flow (SOF), which compensates for image displacement caused by UAV motion in real time [22]. This ensures the stability of detection boxes under dynamic camera motion and significantly improves tracking robustness in complex UAV scenarios.

2. Materials and Methods

2.1. The TCYOLO Small-Object Detection Algorithm

YOLOv11, released by Ultralytics in 2024, builds upon the strengths of YOLOv8. Through a more efficient C3k2 module, an optimized feature fusion mechanism, and an improved detection head design, it further enhances the model’s detection accuracy and inference efficiency [23]. Consequently, while maintaining lightweight characteristics, it significantly improves the perception capability for multi-scale small objects [24]. However, when confronted with the challenges in low-altitude UAV remote sensing—such as extremely small target sizes, dense distributions, and severe background clutter—YOLOv11n still suffers from insufficient feature representation, leading to issues of missed and false detections [25]. To address these limitations, this paper proposes TCYOLO, a targeted improved model. By incorporating multi-scale feature enhancement and attention mechanisms, TCYOLO significantly boosts small-object detection performance while ensuring real-time processing capability. The overall architecture of TCYOLO is illustrated in Figure 1.

As illustrated in Figure 1, the proposed TCYOLO architecture is designed to address these specific challenges. It takes an RGB image of size 640 × 640 × 3 as input. The input image first passes through a convolutional layer with kernel size 3, stride 2, and padding 1, which reduces the spatial dimensions to 320 × 320 while increasing the channel depth. This initial block is followed by a series of convolutional and C3k2-based modules that form the backbone network for feature extraction. To tackle the issues of extremely small target sizes and dense distributions prevalent in UAV imagery, the backbone incorporates two key improvements: a feature enhancement layer to strengthen the representation of small objects, and an optimized fusion module specifically designed to handle densely distributed objects [26].

Feature Enhancement Layer: The C3k2 module integrates the Triplet Attention mechanism. Through its triple-branch parallel design, it simultaneously captures information from both channel and spatial dimensions, enhancing the feature representation capability for small objects with minimal computational overhead, thereby addressing the challenge of feature extraction [27]. In the backbone, C3k2 modules with Triplet Attention (denoted as C3k2-TA) are inserted at multiple scales, as shown in Figure 1, to refine features at different resolutions.

Optimized Fusion for Dense Objects: The CCFM-UAV (Cross-Scale Fusion Module for UAV) is designed specifically to handle densely distributed targets. It strengthens the effective integration of multi-level features and facilitates information flow across scales. In the architecture, CCFM-UAV follows the backbone and incorporates Spatial Pyramid Pooling Fast (SPFF) and C2PSA modules to aggregate multi-scale contextual information before passing to the detection heads.

2.1.1. C3k2-TA

Variations in target scale, complex object morphology, and cluttered backgrounds pose significant challenges to object detection in UAV low-altitude imagery. Conventional convolutional neural networks employ fixed-grid convolutional kernels for feature extraction, which limits their adaptability to scale variations and geometric deformations of objects, particularly resulting in suboptimal detection accuracy for small-scale or highly deformed targets [28]. In the standard YOLOv11n backbone, the C3k2 module extracts features using stacked convolutional layers with a fixed receptive field, which inherently lacks the flexibility to adaptively weight feature responses across spatial and channel dimensions. To address this limitation, we propose to integrate the Triplet Attention (TA) mechanism into the C3k2 module, constructing the C3k2-TA feature enhancement layer. Specifically, the TA block is sequentially connected after the final convolution layer of the original C3k2 structure, allowing the fused feature maps generated by the C3k2 module to be further refined through cross-dimensional attention weighting. The architecture of the C3k2-TA module is illustrated in Figure 2.

TA employs a triple-branch parallel design that simultaneously models interdependencies across all three tensor dimensions—height, width, and channel—through dimensional permutation and residual transformation. The first two branches compute attention weights along the channel dimension C and the spatial dimensions (H and W), respectively, implemented via rotation operations and residual techniques to achieve low computational cost; the bottom branch, similar to CBAM, captures spatial dependencies [29]. The resulting tensors from each branch are aggregated via simple average pooling to form the final attention output. The detailed TA mechanism is shown in Figure 3.

Quantitatively, integrating TA into C3k2 introduces no additional parameters (2.58 M vs. 2.58 M for baseline) and only a marginal increase in FLOPs to 6.6 GFLOPs, while improving mAP@0.5 to 33.5%. The ablation study confirms that C3k2-TA outperforms other common attention mechanisms integrated into C3k2 with the lowest computational overhead, achieving a balanced gain in both precision (44.3%) and recall (33.4%). This efficiency is due to TA’s rotation-based cross-dimensional interaction, which reduces complexity compared to fully connected attention designs.

The parallel design of TA is particularly suited to the challenges of UAV low-altitude scenes. In such imagery, small objects often occupy very few pixels with low contrast against cluttered backgrounds. Unlike sequential attention mechanisms that may suppress information across dimensions, TA simultaneously computes attention across all three dimensions and fuses them through averaging, enabling the joint capture of fine-grained channel–spatial correlations crucial for distinguishing small objects from background noise. Moreover, by attending to height and width independently, TA prevents large objects from dominating the attention computation, thereby preserving small-scale feature responses. The resulting enhanced feature representation improves the sensitivity of the network to small-object details, making detection more precise and robust across diverse UAV scenarios [30].

2.1.2. CCFM-UAV

The neck structure of YOLOv11n employs a multi-scale feature fusion mechanism based on PAFPN with C3k2 and C2PSA modules, enhancing semantic information transmission through top-down and bottom-up pathways. However, this architecture exhibits limitations in capturing shallow, fine-grained details [31], since repeated upsampling and downsampling operations inevitably cause spatial detail loss, and the channel dimensions of features from different backbone levels are not explicitly aligned before fusion. Moreover, for extremely small objects in UAV perspectives—which may occupy only a few pixels in the input image—the default three-level output (P3–P5, with strides of 8, 16, and 32) lacks a sufficiently high-resolution feature map to retain discriminative spatial information. To address these issues, we propose the CCFM-UAV (Cross-Scale Feature Fusion Module for UAV), a targeted redesign of the neck region optimized for multi-scale small-object detection in UAV low-altitude scenes. The structure of CCFM-UAV is illustrated in Figure 4.

Inspired by the Cross-scale Feature Fusion Module (CCFM) [14] and incorporating channel and spatial attention mechanisms, CCFM-UAV introduces three structural modifications over the standard PAFPN. First, a high-resolution P2 detection head with a stride of 4 (160 × 160 feature map) is added to provide finer spatial granularity for extremely small targets, generated from a shallow backbone feature map through a dedicated 1 × 1 convolution that controls the parameter scale. Prior studies [32,33] have demonstrated that such high-resolution feature maps are beneficial for small-object detection due to their finer granularity and better preservation of fine-grained details. Second, the fusion pathways are optimized by applying a linear 1 × 1 convolution (without activation) before each fusion point to unify channel dimensions to c = 256, and a non-linear 1 × 1 convolution (with activation) prior to upsampling to enhance representational capacity. The outputs of all C3k2 and 3 × 3 convolution blocks within the neck are also standardized to 256 channels. Third, CCFM-UAV employs element-wise addition for fusing features at the same resolution and concatenation for cross-resolution fusion, a combination that better preserves fine spatial structures than the concatenation-dominated approach of PAFPN.

The quantitative benefits of these modifications are validated in the neck comparison experiment. Compared with the PAFPN baseline, CCFM-UAV improves mAP@0.5 by 3.7% and mAP@0.5:0.95 by 2.4%, while simultaneously reducing the parameter count from 2.6 M to 2.08 M (−19.4%) and the model size from 5.8 MB to 4.7 MB (−14.5%). This simultaneous accuracy improvement and parameter reduction is achieved because the unified 256-channel design eliminates redundant high-dimensional intermediate features, while the P2 head adds only minimal parameter overhead. Other mainstream feature fusion designs achieve different trade-offs, but for the specific challenges of UAV small-object detection dominated by spatial resolution constraints, CCFM-UAV demonstrates a uniquely favorable balance between enhanced multi-scale feature fusion efficiency and reduced model complexity.

This design philosophy directly addresses the information distribution characteristics of UAV low-altitude imagery [34]. In such scenes, the primary bottleneck is the limited spatial resolution for objects occupying fewer than 10–15 pixels rather than insufficient semantic complexity. The P2 head effectively doubles the spatial granularity for the smallest targets, while channel unification prevents the over-allocation of parameters to mid-level features that contribute less to small-object discrimination. By ensuring shallow detail-rich features and deep semantic features are compatible before fusion, CCFM-UAV maximizes the utilization of fine-grained spatial cues for more accurate small-object detection [35]. This makes the model better suited for the dense, multi-scale target distributions that characterize UAV-based remote sensing tasks.

2.2. SofByteTrack Object Tracking Algorithm

2.2.1. The ByteTrack Algorithm

ByteTrack is an efficient detection-based multi-object tracking (MOT) algorithm. Its core idea is to fully utilize the bounding box information output by the object detector for trajectory association, eliminating the reliance on complex appearance feature re-identification (ReID) modules [36]. The algorithm employs the Kalman Filter to predict target states and combines it with the Hungarian Algorithm to achieve optimal bipartite graph matching between trajectories and detection boxes [37].

The core innovation of ByteTrack lies in its proposed hierarchical association strategy, which effectively exploits the valuable information contained within low-confidence detection boxes. The workflow of this strategy is illustrated in Figure 5.

(1) First Association: High-confidence detection boxes (score ≥ τ_high) are matched with all existing trajectories. Successfully associated trajectories are then updated.

(2) Second Association: The trajectories that remain unmatched from the first association are subsequently matched with low-confidence detection boxes (τ_low ≤ score < τ_high). This step aims to recover targets whose detection confidence has dropped due to factors like occlusion or motion blur.

(3) Trajectory Management: New trajectories are initialized for any remaining high-score detection boxes that were not matched. Conversely, trajectories that remain unmatched for an extended period are terminated.

2.2.2. Optimization of Object Tracking Algorithms

The hierarchical association strategy of ByteTrack significantly mitigates target loss in challenging scenarios such as occlusion and scale variation, demonstrating excellent performance on public datasets like the MOT Challenge [38]. However, ByteTrack’s design implicitly assumes a relatively stationary camera viewpoint, meaning the motion in the image stems primarily from the movement of the targets themselves [39]. This assumption no longer holds in UAV (Unmanned Aerial Vehicle) scenarios with dynamic platforms. Camera motion induces a global transformation of the image coordinate system, which undermines the trajectory association based on motion prediction. Consequently, this leads to a substantial increase in false negatives and identity switches (ID switches) [40].

To address the aforementioned issue, this paper designs the SofByteTrack algorithm by introducing a Camera Motion Compensation (CMC) module, which estimates the global coordinate transformation by registering the background motion between consecutive frames [41]. Specifically, an image registration method based on Sparse Optical Flow (SOF) is employed to reveal the background motion in two steps, as outlined below:

First, key points are extracted from the image at frame k − 1 using the Shi-Tomasi corner detector. Simultaneously, a dynamic object mask (based on detection boxes) is generated to exclude regions containing moving targets, ensuring that feature points are detected only on the static background. Second, the Lucas-Kanade pyramid optical flow algorithm is employed to track the feature points from frame

k - 1

to frame

k

, obtaining sparse correspondences. Finally, the RANSAC algorithm is used to robustly estimate an affine transformation matrix

A_{k - 1}^{k}

∈

R^{2 \times 3}

from frame

k - 1

to frame

k

. This matrix models global transformations—such as translation, rotation, and scaling—caused by camera motion.

A_{k - 1}^{k} = [M_{2 \times 2} | T_{2 \times 1}] = [\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{matrix}]

(1)

In Equation (1), the estimated 2

\times

3 affine matrix

A_{k - 1}^{k}

is partitioned as [

M_{2 \times 2} | T_{2 \times 1}

], where

M_{2 \times 2} = [\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{matrix}]

,

T_{2 \times 1} = [\begin{matrix} a_{13} \\ a_{23} \end{matrix}]

.

The Kalman state is

x = {[x, y, a, h, \dot{x}, \dot{y}, \dot{a}, \dot{h}]}^{T}

.

To compensate for camera motion, the 2 × 2 linear part

M_{2 \times 2}

is applied to each 2-dimensional sub-vector of the state—namely the location

(x, y)

, the shape

(a, h)

, and their velocities

(\dot{x}, \dot{y})

and

(\dot{a}, \dot{h})

. This is realized by constructing the 8

\times

8 block-diagonal matrix

{\tilde{M}}_{k - 1}^{k} = [\begin{matrix} M_{2 \times 2} & 0 & 0 & 0 \\ 0 & M_{2 \times 2} & 0 & 0 \\ 0 & 0 & M_{2 \times 2} & 0 \\ 0 & 0 & 0 & M_{2 \times 2} \end{matrix}], {\tilde{T}}_{k - 1}^{k} = [\begin{matrix} a_{13} \\ a_{23} \\ \begin{matrix} 0 \\ 0 \\ \begin{matrix} ⋮ \\ 0 \end{matrix} \end{matrix} \end{matrix}]

(2)

which appears in Equation (2) with the compact notation

{\tilde{M}}_{k - 1}^{k}

for brevity. The translation component only affects the target’s spatial position

(x, y)

. Therefore the 8 × 1 compensation vector

{\tilde{T}}_{k - 1}^{k}

is formed by appending six zeros to

T_{2 \times 1}

.

This explicit expansion clarifies how the 2D affine transformation parameters are mapped to the full 8-state Kalman representation. The same linear warping across all four 2-dimensional state components is a deliberate engineering choice that treats the aspect ratio

a

and height

h

as being deformed by the global image motion, a pragmatic convention consistent with recent camera-motion-aware trackers [42].

{\tilde{x}}_{k | k - 1}^{'} = {\tilde{M}}_{k | k - 1}^{k} {\hat{x}}_{k | k - 1} + {\tilde{T}}_{k | k - 1}^{k}

(3)

P_{k | k - 1}^{'} = {\tilde{M}}_{k - 1}^{k} P_{k | k - 1} {\tilde{M}}_{k - 1}^{k T}

(4)

Here,

\tilde{M}

∈

R^{2 \times 2}

represents a matrix that contains the scaling part of the affine matrix, the rotation part of the planar view, and the translation part. Define a matrix

{\tilde{M}}_{k - 1}^{k}

∈

R^{8 \times 8}

and

{\tilde{T}}_{k - 1}^{k}

∈

R^{8}

.

In high-speed scenarios, it is necessary to perform full correction on the state vector. When the camera motion is relatively slow compared to the frame rate, Equation (4) can be skipped. By adopting this strategy, the tracker exhibits enhanced robustness against camera motion.

{\hat{x}}_{k | k - 1}

and

P_{k | k - 1}^{'}

represent the predicted state vector and the corrected predicted covariance matrix, respectively, after backward temporal compensation for camera motion. The algorithm then proceeds to the conventional observation update stage of the Kalman filter. At this stage, the Kalman filter integrates sensor observations (i.e., target bounding boxes) with the predicted state to perform fusion and update, thereby ensuring precise tracking of the target and reducing tracking loss and drift. The update equations are given by (5)–(7).

K_{k} = P_{k | k - 1}^{'} H_{k}^{T} {(H_{k} P_{k | k - 1}^{'} H_{k}^{T} + R_{k})}^{- 1}

(5)

{\hat{x}}_{k | k} = {\hat{x}}_{k | k - 1}^{'} + K_{k} (Z_{k} - H_{k} {\hat{x}}_{k | k - 1})

(6)

P_{k | k - 1} = (I - K_{k} H_{k}) P_{k | k - 1}^{'}

(7)

Here,

R_{k}

is the observation noise covariance matrix;

P_{k | k - 1}^{'}

represents the predicted state covariance matrix;

H_{k}

is the observation model matrix;

Z_{k}

is the observation value from the current frame; and

K_{k}

is the Kalman gain; and I stands for the identity matrix.

2.3. Training and Deployment Pipeline

The proposed TCYOLO-SofByteTrack framework follows a decoupled detection-and-tracking paradigm rather than an end-to-end joint optimization strategy. Specifically, the TCYOLO detector and the SofByteTrack tracker are developed and executed as two independent modules.

During the training stage, only the TCYOLO detector requires parameter optimization. The detector is trained on the VisDrone2019 dataset and the self-constructed UAV highway inspection dataset using supervised learning. Standard object detection losses, including classification loss, localization loss, and distribution focal loss, are adopted following the YOLOv11 training framework.

In contrast, SofByteTrack does not require offline training. Similar to the original ByteTrack framework, it is a tracking-by-detection algorithm based on motion prediction and data association. The tracker utilizes the detection results generated by TCYOLO in each frame and performs online trajectory association using the Kalman filter and Hungarian matching algorithm. The proposed sparse optical flow compensation module operates in real time and does not introduce additional trainable parameters.

During inference and deployment, the complete framework operates sequentially in a pipeline manner. First, TCYOLO processes each incoming UAV image frame and outputs object bounding boxes, confidence scores, and category labels. These detection results are then transmitted to the SofByteTrack module. Subsequently, the sparse optical flow-based camera motion compensation module estimates global image displacement between consecutive frames and corrects the predicted target states before data association. Finally, the corrected trajectories are updated through the Kalman filtering process to generate stable tracking outputs.

For real-time deployment on the NVIDIA Jetson Orin NX platform, the TCYOLO detector is accelerated using TensorRT FP16 precision optimization, while the tracking module runs on the CPU side with lightweight optical flow computation. The detector and tracker communicate through frame-level detection outputs, enabling asynchronous and efficient execution under embedded resource constraints.

3. Results

3.1. Dataset and Parameter Configuration

3.1.1. Dataset

Due to the scarcity of publicly available datasets, this paper adopts a dual-validation strategy. The VisDrone2019 dataset is selected as a benchmark for evaluating general small-object detection performance, validating the proposed algorithm’s foundational detection capabilities in complex environments. Concurrently, a self-constructed UAV Highway Intelligent Inspection dataset is employed to assess the algorithm’s effectiveness and practical utility in real-world inspection application scenarios.

VisDrone2019 [33] is a large-scale aerial image dataset jointly developed by Tianjin University and the AISKYEYE team. It comprises 10,209 high-resolution static images and 288 video clips, totaling 261,908 frames. The data was collected across diverse urban and rural scenarios in China under various weather and lighting conditions. The dataset contains 2.6 million manually annotated bounding boxes, primarily focusing on pedestrians, vehicles, and non-motorized transport. It is characterized by features such as small-object dominance, highly imbalanced density distribution, and dynamic scale variations. Widely used for tasks like small-object detection, multi-object tracking, and dense crowd counting, VisDrone2019 is strictly partitioned into a training set (6471 images), a validation set (548 images), and a test set (1610 images).

To meet the specific requirements of UAV highway inspection tasks, this study utilizes the M300 RTK drone equipped with Zenmuse gimbal cameras (H20 and P1) to construct a road marking dataset. The use of different camera resolutions contributes to dataset diversity, which helps demonstrate the algorithm’s applicability and robustness across varying data sources.

First, a static image dataset was developed for detection tasks. The data collection primarily covers internal campus road scenarios, while also including sections of highways, primary roads, and secondary roads, ensuring coverage of complex environments such as shadows, debris, and occlusions. This dataset comprises a total of 8519 images, annotated with four types of road markings: white solid lines, yellow dotted lines, white dotted lines, and yellow solid lines. These images are split into a training set of 7502 images and a validation set of 1017 images to support the training and evaluation of detection models.

Additionally, a separate UAV-based road marking tracking dataset was established. This dataset consists of 8 video sequences, totaling over 4000 frames. While this duration is modest for training a deep neural network from scratch, it is specifically designed to validate the continuous localization capability and robustness of tracking algorithms under dynamic inspection conditions. These sequences encompass various challenging scenarios, such as strong shadow variations, vehicle occlusions, marking wear, and dynamic UAV maneuvers. The length and diversity of these sequences are considered sufficient to rigorously test the performance of trackers on road marking targets. Sample images from the dataset are illustrated in Figure 6a.

Figure 6b presents the manually annotated statistics of the self-constructed dataset. The top-left subfigure shows the number of instances per category, indicating that yellow dotted lines are the predominant class. The top-right subfigure illustrates the size distribution of object bounding boxes, with all box centers normalized and aligned to a common point for comparison. The bottom-left subfigure displays a spatial location heatmap, where dark blue regions represent areas with the highest concentration of object centers. The targets are predominantly distributed in a cross-like pattern, concentrated near the central-lower part of the image. Two intersecting high-density bands are observed near y ≈ 0.5 and x ≈ 0.5, while the left and right edges of the image show relatively sparse target distribution.

In the application of UAV-based vision-based highway inspection systems, accurate road marking detection serves as a crucial technical component for enabling autonomous navigation [43]. By precisely identifying the position of road markings, the system can provide real-time trajectory correction information to the UAV flight control system, thereby achieving autonomous inspection flight along the road centerline, which ensures the safety and effectiveness of inspection missions. Therefore, constructing a high-quality road marking detection dataset is of great significance for enhancing the overall performance of UAV-based intelligent inspection systems.

To clarify the observation resolution and its impact on target distinguishability, this study provides a detailed calculation of the Ground Sampling Distance (GSD) for the acquired imagery. Based on the pinhole imaging principle, the GSD is calculated using the following formula:

G S D = \frac{F l i g h t A l t i t u d e \times P i x e l S i z e}{F o c a l L e n g t h}

(8)

For highway inspection tasks, the UAV typically maintains a flight altitude of approximately 10 m to capture detailed road markings. To illustrate the effect of altitude on resolution and ground coverage, Table 1 presents the GSD values and corresponding ground coverage for different flight altitudes, calculated for the two camera models used in this study (DJI Zenmuse P1 and H20).

As shown in the statistical summary of our dataset (Figure 6b), the vast majority of target bounding boxes have relative width < 0.02 and relative height < 0.2. Following the widely used criterion of relative area (less than 0.12% of the total image area) [44], these targets are confirmed as small objects.

Using the GSD values, we can translate this pixel-based definition into physical dimensions. At a typical flight altitude of 10 m (GSD = 0.125 cm/pixel), the original image covers about 10.24 m × 6.85 m. A target occupying 0.12% of this image area corresponds to a ground area of approximately 0.084 m² (e.g., a square of roughly 0.29 m × 0.29 m). This scale is consistent with the size of individual road marking elements, such as segments of dashed lines, thereby providing a concrete physical interpretation of the term “small object” in our study.

3.1.2. Parameter Configuration

The experimental platform was configured with a Windows operating system and an NVIDIA GeForce RTX 3090 Ti GPU (24 GB of memory). The software environment included PyTorch 2.1.0, Python 3.8, and CUDA 11.8. Detailed experimental settings are provided in Table 2.

All input images were uniformly resized to 640 × 640 pixels. Under identical experimental conditions, all models were trained for 300 epochs using the same set of training hyperparameters, as configured in Table 2.

The inference validation and real-time performance evaluation were conducted on an NVIDIA Jetson Orin NX embedded platform. The experimental environment comprised PyTorch 2.5, TensorRT 10.3, Python 3.10, and JetPack 6.1.0. The hardware specifications of the embedded platform are detailed in Table 3.

3.2. Evaluation Index

The evaluation metrics employed in this experiment include mean Average Precision (mAP) at different Intersection over Union (IoU) thresholds (specifically mAP@0.5 and mAP@0.5:0.95), Precision (P), Recall (R), number of parameters (Params), floating-point operations (FLOPs), model size, and frames per second (FPS).

As a standard metric for object detection, mAP comprehensively evaluates model accuracy across different IoU thresholds. Specifically, mAP@0.5 refers to the mAP value calculated at an IoU threshold of 0.5, while mAP@0.5:0.95 denotes the average mAP computed over IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. Higher values of both mAP@0.5 and mAP@0.5:0.95 indicate better overall model accuracy. The calculation of mAP is detailed in Formula (9).

m A P = \frac{1}{n} \sum_{k - 1}^{k = n} A P_{k}

(9)

The precision rate (P) is as shown in Formula (10).

p r e c i s i o n = \frac{T P}{T P + F P}

(10)

The recall rate (R) is as shown in Equation (11).

r e c a l l = \frac{T P}{T P + F N}

(11)

M O T A

(Multiple Object Tracking Accuracy),

M O T P

(Multiple Object Tracking Precision), and

H O T A

(Higher Order Tracking Accuracy) are metrics representing the performance of multiple object trackers, corresponding to overall tracking accuracy, localization precision, and higher-dimensional tracking accuracy, respectively.

H O T A

provides a more comprehensive and balanced assessment of tracker performance.

I D S W

(ID Switch) refers to the number of identity switches occurring during the tracking process.

I D F 1

(ID F1 Score) is defined as the ratio of correctly identified detections to the average of the number of ground truth objects and the number of computed detections [45].

M O T A = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S W_{t})}{\sum_{t} G T_{t}}

(12)

F N_{t}

: The number of False Negatives in frame

t

, i.e., the number of ground truth targets missed by the detector in that frame.

F P_{t}

: The number of False Positives in frame

t

, i.e., the number of spurious targets output by the detector in that frame.

I D S W_{t}

: The number of Identity Switches in frame

t

, i.e., the number of times the identity label of a tracked target incorrectly changes during the tracking process.

G T_{t}

: The number of Ground Truth targets in frame

t

, i.e., the total number of actual objects present in that frame.

M O T P = \frac{\sum_{t, i} d_{t, i}}{\sum_{t} c_{t}}

(13)

d_{t, i}

: The localization error for the i-th successfully matched pair (i.e., a correct association between a detection box and a ground truth box) in frame

t

. This is typically measured by the bounding box overlap (e.g., IoU) or the distance between center points.

c_{t}

: The number of successfully matched target pairs in frame

t

.

H O T A = \sqrt{\frac{\sum_{c \in {T P}} A (c)}{T P + F N + F P}}

(14)

T P

: The set of True Positive Associations, i.e., trajectory segments that are correctly matched between the detections and the ground truth in both space and time.

A (c)

: The association score for each correct association c. It comprehensively measures both the accuracy of the detection and the stability of the identity maintenance for that association.

F N

: The total number of False Negative Associations across the entire video sequence, i.e., ground truth trajectory segments that should have been associated but were not.

F P

: The total number of False Positive Associations across the entire video sequence, i.e., predicted trajectory segments that were incorrectly associated.

The

I D F 1

score corresponds to the number of correctly tracked detections across the entire video sequence.

A (c) = \frac{| T P A (c) |}{| T P A (c) | + | F N A (c) | + | F P A (c) |}

(15)

T P A (c)

: True Positive Associations—the number of trajectory segments correctly associated with ground truth trajectories.

F N A (c)

: False Negative Associations—the number of ground truth trajectory segments that should have been associated but were not.

F P A (c)

: False Positive Associations—the number of predicted trajectory segments incorrectly associated with the detection.

I D F 1 = \frac{I D T P}{I D T P + 0.5 I D F P + 0.5 I D F N}

(16)

I D T P

(Identity True Positives) refers to the number of detected targets that are correctly assigned their identities in the video sequence.

I D F N

(Identity False Negatives) denotes the number of ground truth targets that are missed and not assigned any identity.

I D F P

(Identity False Positives) represents the number of detected targets that are incorrectly assigned an identity.

3.3. Experimental Results and Analysis

3.3.1. Attention Ablation Experiment

To validate the effectiveness of integrating the TA module with C3K2, an ablation study was conducted using YOLOv11n as the base architecture. Specifically, C3K2 was combined with different attention mechanisms (SE, MoCAttention, CBAM, and TA) to evaluate their impact on model performance. The experimental results, presented in Table 4, provide quantitative evidence supporting the effectiveness of the proposed method.

3.3.2. Neck Comparison Experiment

To validate the effectiveness of the proposed feature fusion method, this study conducts a comparative analysis between CCFM-UAV and several existing mainstream feature fusion approaches. All compared methods are configured with the same small-object detection head for a fair assessment. The experimental results are presented in Table 5.

The compared methods include: (1) PAFPN: The feature fusion technique adopted in YOLOv11, which enhances information flow and aggregation between feature maps by adding horizontal connections along the top-down and bottom-up pathways, achieving multi-scale feature representation. (2) Slim-neck: An efficient neck architecture that utilizes GSConv and VoV-GSCSP modules. It maintains detection accuracy while reducing computational complexity through multi-scale feature fusion and hierarchical processing. (3) BiFPN: Building upon PAFPN, it assigns learnable weights to each input feature to optimize the fusion process. Its structure is optimized by removing single-input nodes, adding extra connections between input and output nodes at the same level, and treating each bidirectional path as a standalone feature network layer, thereby enhancing cross-scale connectivity [46].

Table 5. Comparison diagram of neck networks.

Model	mAP@0.5%	mAP@0.5:0.95	P/%	R/%	Params/M	FLOPs/G	Model Size/MB
PAFPN	36.3	21.9	47.6	35.3	2.6	10.2	5.8
Slim-neck [47]	35.6	21.3	46.4	35.1	2.65	9.8	5.8
BiFPN	39.3	23.7	49.4	37.6	2.76	12.8	6.0
CCFM-UAV	40	24.3	50.8	38.7	2.08	13.0	4.7

The experimental results demonstrate that the proposed CCFM-UAV exhibits significant advantages in both detection performance and model efficiency. Compared with the original PAFPN neck, CCFM-UAV improves precision, recall, mAP@0.5, and mAP@0.5:0.95 by 3.2%, 3.4%, 3.7%, and 2.4%, respectively, while simultaneously reducing the parameter count from 2.6 M to 2.08 M and decreasing the model size from 5.8 MB to 4.7 MB.

The parameter reduction mainly benefits from three structural optimizations. First, all intermediate feature maps in the neck are unified to a fixed width of 256 channels, which eliminates redundant high-dimensional feature transformations and reduces channel adaptation overhead. Second, CCFM-UAV replaces part of the concatenation-dominated fusion strategy in PAFPN with element-wise addition for same-scale feature fusion, thereby avoiding channel expansion after fusion and reducing the parameter burden of subsequent convolution layers. Third, lightweight 1 × 1 projection convolutions are employed before feature fusion and upsampling operations, which further decreases computational redundancy while preserving discriminative spatial information.

Although an additional P2 detection head is introduced, its parameter overhead remains limited because shallow backbone features are compressed before entering the fusion pathway. These improvements demonstrate that CCFM-UAV achieves a more effective balance between multi-scale feature representation capability and lightweight model design for UAV small-object detection tasks.

3.3.3. Ablation Experiment

Based on the YOLOv11n baseline model, this study conducted ablation experiments to validate the effectiveness of the proposed C3k2-TA module and CCFM-UAV module. The experimental results show (Table 6) that after independently introducing the C3k2-TA module, while the number of parameters and model size remain unchanged, the precision increases by 0.8%. This indicates that the module achieves marginal performance gains with minimal computational overhead.

When the CCFM-UAV module is introduced independently, mAP@0.5 improves significantly by 6.7%, and mAP@0.5:0.95 increases by 4.8%. Meanwhile, the number of parameters is reduced by 19.4%, and the model size decreases by 14.5%. This demonstrates that the CCFM-UAV module serves as the primary contributor to performance enhancement, effectively improving detection accuracy while achieving notable model compression.

When both modules are jointly applied, the model attains its optimal performance, with mAP@0.5 reaching 40.7% and mAP@0.5:0.95 reaching 24.7%, representing improvements of 7.4% and 5.2%, respectively, compared to the baseline model. In summary, the ablation experiments confirm the effectiveness of each proposed module and their synergistic contribution to performance gains.

3.3.4. Comparison Experiments of Different Detection Models

To further demonstrate the advantages of the proposed algorithm, in addition to YOLOv8n, models including YOLOv8s, YOLOv9t, YOLOv10n, YOLOv10s, YOLOv11n, and YOLOv11s were selected for experiments on the VisDrone2019 dataset. The relevant experimental results are presented in Table 7.

As can be observed from the results in Table 7, models such as YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, and the latest YOLOv12n are of a similar scale to the proposed TCYOLO in terms of model size, with fewer parameters and faster deployment speed—some even achieve higher FPS. However, they exhibit limitations in feature representation, making them prone to false positives and missed detections, which results in significantly lower performance compared to the proposed model. For instance, compared to YOLOv11n, TCYOLO improves mAP@0.5, mAP@0.5:0.95, precision (P), and recall (R) by 7.4%, 5.2%, 5.8%, and 5.0%, respectively, while reducing parameter count and model size by 19.3% and 14.5%. Even against the newest YOLOv12n, TCYOLO achieves a substantial gain of 7.0% in mAP@0.5 and 5.3% in mAP@0.5:0.95, with comparable model complexity, demonstrating its superior feature extraction capability.

Models such as YOLOv8s, YOLOv11s, and YOLOv12s are significantly larger and more complex than TCYOLO, making them difficult to deploy and run stably on embedded airborne devices of UAVs. Moreover, their performance and FPS are lower than those of the proposed model. For example, compared to YOLOv12s, TCYOLO improves mAP@0.5 and mAP@0.5:0.95 by 0.8% and 0.8%, respectively, while maintaining a much lower model complexity (2.08 M vs. 9.23 M parameters, 13.2 vs. 21.2 GFLOPs), which facilitates its deployment on embedded platforms.

MFFSODNet employs a designed Multi-scale Feature Extraction Module (MSFEM) to achieve significant performance gains, but its high computational overhead compromises real-time inference capability [44]. Among Transformer-based models, RT-DETR, an end-to-end real-time object detector evolved from DETR, enjoys enhanced detection accuracy [53]. However, its high parameter complexity and computational demands restrict its applicability for stable operation on resource-constrained UAV onboard platforms. DEIM, an improved DETR model based on optimized matching mechanisms, demonstrates lower detection accuracy (39.1% mAP@0.5 and 22.2% mAP@0.5:0.95) compared to TCYOLO (40.7% mAP@0.5 and 24.7% mAP@0.5:0.95), while also incurring a notably larger model size (14.7 MB vs. 4.7 MB) and higher parameter count (3.7 M vs. 2.08 M). Faster R-CNN, as a representative two-stage algorithm, suffers from high model complexity (41.7 M parameters, 134.2 GFLOPs, and 108 MB) and exhibits particularly weak performance in small-object detection, with its mAP@0.5:0.95 (9.8%) substantially lower than that of TCYOLO (24.7%).

To further benchmark against recent state-of-the-art (SOTA) detectors, we include TOE-YOLO, LRDS-YOLO, SRTSOD-YOLO-n, UAV-DETR, and QueryDet for a comparative analysis focused on the accuracy-efficiency trade-off. The quantitative results are presented in Table 7.

Unlike methods that pursue extreme accuracy at the cost of high computational overhead, TCYOLO is specifically designed to achieve a favorable balance between detection performance and model complexity—a critical requirement for resource-constrained UAV platforms. Its incremental contributions over existing SOTA detectors lie in two unique architectural designs: (1) the C3k2-TA module, which enhances small-object feature representation via lightweight triplet attention, and (2) the CCFM-UAV neck, which strengthens cross-scale feature fusion while simultaneously reducing parameter count. These designs enable competitive accuracy without sacrificing inference speed or model compactness.

Among the compared lightweight SOTA methods, TOE-YOLO adopts rotated feature extraction and attention-based concatenation, maintaining a lightweight profile (6.6 GFLOPs, 2.62 M parameters). However, its detection accuracy (33.8% mAP@0.5, 19.7% mAP@0.5:0.95) falls significantly below that of TCYOLO, suggesting that its feature representation capacity is insufficient for the challenging small-object scenarios in UAV imagery. Similarly, QueryDet employs query-based sparse feature sampling to reduce redundant computations, yet its performance remains limited, achieving only 31.6% mAP@0.5 and 17.4% mAP@0.5:0.95 despite relatively high computational complexity (44.3 GFLOPs, 18.9 M parameters). This indicates that lightweight query-driven strategies alone are insufficient to effectively capture fine-grained UAV object features. SRTSOD-YOLO-n achieves the highest inference speed (147 FPS) with a comparable parameter budget, yet its accuracy (36.3% mAP@0.5, 21.8% mAP@0.5:0.95) remains moderate, indicating a trade-off that favors speed over detection quality. In contrast, TCYOLO surpasses TOE-YOLO, QueryDet, and SRTSOD-YOLO-n in mAP@0.5 by 6.9, 9.1, and 4.4 percentage points, respectively, while maintaining a real-time inference speed of 129 FPS and an exceptionally low parameter count (2.08 M) that is among the smallest of all compared detectors—a direct benefit of the parameter-efficient CCFM-UAV design.

On the other end of the spectrum, LRDS-YOLO and UAV-DETR achieve superior detection accuracy. LRDS-YOLO reaches 43.6% mAP@0.5 and 26.6% mAP@0.5:0.95 through lightweight downsampling and re-calibration mechanisms. UAV-DETR further improves detection performance, achieving the highest accuracy among all compared methods with 52.5% mAP@0.5 and 32.7% mAP@0.5:0.95, benefiting from transformer-based global feature modeling and enhanced long-range dependency learning. However, these methods incur substantially higher computational costs. LRDS-YOLO requires 24.1 GFLOPs, while UAV-DETR reaches 72.5 GFLOPs with 21.2 M parameters and a model size of 41.3 MB—far exceeding the resource budget of TCYOLO (13.2 GFLOPs, 2.08 M parameters, 4.7 MB). In particular, UAV-DETR introduces more than five times the computational complexity and over ten times the parameter count of TCYOLO, resulting in a lower inference speed of 70 FPS. Such computational burdens limit their deployability on UAV-embedded platforms where power consumption, memory footprint, and latency constraints are stringent. TCYOLO, while trading off approximately 3–12 percentage points in mAP@0.5, reduces GFLOPs by nearly half compared with LRDS-YOLO, and by more than 80% compared with UAV-DETR, while achieving a highly compact model size of only 4.7 MB, making it one of the most lightweight models in the evaluation.

In summary, among the compared SOTA detectors, TCYOLO uniquely occupies an advantageous operating point on the accuracy–efficiency curve: it achieves the highest mAP@0.5 (40.7%) within the lightweight detector category (≤2.1 M parameters, ≤15 GFLOPs), while maintaining an exceptionally compact model size (4.7 MB) and real-time inference speed. This balanced performance profile stems directly from the synergistic design of C3k2-TA and CCFM-UAV, which jointly enhance small-object feature extraction without introducing excessive computational overhead—a combination that distinguishes TCYOLO from existing methods optimized predominantly for either accuracy or speed alone.

Figure 7 illustrates the trade-off between computational complexity and detection accuracy across different detection models. The red dashed line represents the high-performance threshold (

m A P @ 0.5

> 40%), and the green dashed line indicates the efficiency threshold (FLOPs ≤ 20 ×

10^{9}

). It can be clearly observed from Figure 7 that TCYOLO falls exactly within the optimal performance region bounded by the two threshold lines, simultaneously satisfying the requirements of high accuracy (

m A P @ 0.5

> 40%) and high efficiency (FLOPs ≤ 20 ×

10^{9}

). Compared with other detection models, TCYOLO achieves a better balance between accuracy and efficiency, attaining competitive detection accuracy while maintaining low computational overhead. This verifies the effectiveness of the proposed improvement strategy and offers an optimized solution for real-time detection in resource-constrained environments.

3.3.5. Object Detection Visualization

This paper selects two typical low-altitude remote sensing scenarios of UAVs for testing, as shown in Figure 8. The first row illustrates a top-down small-object distribution scenario, while the second row presents a multi-scale dense object scenario. In the figure, the first column displays the inference results of the YOLOv11n baseline model, and the second column shows the inference results of the proposed TCYOLO model.

The top-down small-object distribution scenario represents a typical aerial top-down perspective, where objects are sparsely scattered, primarily including pedestrians in open square areas and vehicles on roads. The results indicate that due to the small target size and low contrast with the background, the YOLOv11n model exhibits noticeable missed detections for pedestrians in the square area. In contrast, the TCYOLO model demonstrates significantly improved detection performance for pedestrians in the same region, indicating that the enhanced model possesses stronger capabilities for small-object feature extraction and robustness against background interference.

The multi-scale dense object scenario involves a mixed distribution of large foreground targets and small distant targets, presenting higher demands on the model’s multi-scale detection capability [54]. As shown in the red boxed areas within the figure, the YOLOv11n model exhibits significant missed detections for distantly located, densely clustered small-scale objects (pedestrians and vehicles). In comparison, the TCYOLO model shows a marked improvement in detection recall within these areas, effectively reducing the miss rate for small objects. This validates the adaptability of the proposed improvements to multi-scale scenarios.

3.3.6. Comparison Experiment of Target Tracking Algorithms

This study proposes the TCYOLO detector and the SofByteTrack tracker, both of which contribute significant performance improvements in multi-object tracking tasks, as shown in Table 8.

From the perspective of detector enhancement, when TCYOLO is paired with the original ByteTrack tracker, it achieves an HOTA score of 42.9%, representing an improvement of 4.5 percentage points compared to the YOLOv11 + ByteTrack baseline. MOTA increases by 5.9 percentage points, and IDF1 reaches 53.5%. These results indicate that TCYOLO significantly improves target localization accuracy and feature representation capability, thereby providing higher-quality detection results for subsequent tracking.

Regarding the tracker improvement, the combination of YOLOv11 and SofByteTrack already shows a clear performance gain over other trackers. However, the synergistic combination of TCYOLO and SofByteTrack achieves the optimal performance, with HOTA reaching 45.3%, MOTA at 42.7%, and IDF1 as high as 57.8%. All three core metrics are the highest in the table, validating the superiority of the improved tracker in data association and trajectory management.

It is particularly noteworthy that the TCYOLO-SofByteTrack combination demonstrates exceptional performance in target management. The number of Mostly Tracked targets (MT) reaches 548, and the number of Identity Switches (IDSW) is reduced to 803, which correspond to improvements of 37.3% and 23.4%, respectively, compared to the baseline method. This significantly enhances the continuity and stability of the tracking trajectories. Although the inference speed of this combination is not the fastest, it still meets the requirements for real-time applications.

In summary, the TCYOLO detector and SofByteTrack tracker work in synergy to achieve comprehensive optimization in detection accuracy, tracking precision, and identity consistency for multi-object tracking tasks. This provides a high-precision, highly robust solution for multi-object tracking in complex scenarios [59].

3.3.7. Visualization of Target Tracking Algorithm

To validate the effectiveness of the proposed improved tracking algorithm, a visual comparative analysis was conducted using the sequence uav0000137_00458_v from the VisDrone2019-MOT dataset. This sequence possesses typical challenging characteristics. First, the UAV performs rapid horizontal translation during data acquisition, causing significant camera-induced global motion and substantial inter-frame pixel displacement. This places high demands on the algorithm’s motion modeling capability [60]. Second, the sequence covers complex environments such as urban roads and parking lots, containing targets of various scales including vehicles and pedestrians, which ensures good scene diversity. Therefore, this sequence can effectively evaluate the performance of the proposed optical flow-based motion prediction module and camera motion compensation mechanism in practical applications.

Figure 9 presents a comparison of tracking performance at frames 195, 200, and 205. The first row shows the results of the baseline method, YOLOv11 + ByteTrack, while the second row displays the results of the proposed improved algorithm, TCYOLO + SofByteTrack, where significant differences can be observed.

In terms of stability, as indicated by the red annotated regions, the bounding boxes for vehicle targets in the first row exhibit noticeable localization drift and size fluctuation under rapid UAV rotation, with the tracker failing to maintain a stable lock [61]. This primarily stems from the prediction bias of the traditional Kalman filter-based motion model under camera motion conditions. In contrast, the improved algorithm effectively suppresses abnormal bounding box fluctuations by introducing optical flow-based motion prediction and a camera motion compensation mechanism, thereby maintaining tracking continuity and stability.

Furthermore, regarding small-object detection, the baseline method shows severe missed detections for small-scale vehicle targets on distant streets [62]. Conversely, the improved algorithm, leveraging the multi-scale feature fusion capability of the TCYOLO detector, significantly enhances the detection rate and tracking robustness for small targets. These visual results corroborate the quantitative evaluation metrics, jointly validating the superiority of the proposed method in UAV tracking scenarios.

3.3.8. Algorithm Deployment Experiment

To objectively evaluate the practical deployment efficiency of the algorithm, benchmark testing was conducted on the NVIDIA Jetson Orin NX platform using the VisDrone2019 validation set. The comparison metrics included mAP@0.5, mAP@0.5:0.95, Precision (P), Recall (R), and FPS. Additionally, the optimization effect of the TensorRT precision calibration strategy on model inference performance was assessed. The testing environment was configured with a batch size of 1 and an input image size of 640 × 640. The specific comparison results are presented in Table 9.

TCYOLO achieves an mAP@0.5 of 41.7%, representing a 7.7 percentage point improvement over the baseline model. Its recall rate (R) increases from 33.8% to 40.7%, indicating that the improved model effectively reduces the miss rate for small objects. In terms of model size, TCYOLO is only 4.7 MB, which is 14.5% smaller than the baseline model, making it more suitable for deployment on edge devices. Regarding inference speed, the introduction of feature enhancement modules reduces the original model’s speed to 19.36 FPS. However, after FP16 quantization, the speed increases to 33.02 FPS, meeting real-time detection requirements while maintaining nearly equivalent accuracy. Overall, TCYOLO exhibits a significant advantage in detection accuracy and achieves an effective balance between precision and speed under FP16 quantized deployment.

3.3.9. Generalization Experiment

This study conducts a systematic comparative evaluation of the performance between the YOLOv11 and TCYOLO algorithms on the lane detection task.

As shown in the Table 10, TCYOLO demonstrates excellent detection accuracy, achieving an mAP@0.5 of 96.9%, which represents a 1.3% improvement over YOLOv11’s 95.6%. In terms of per-category detection accuracy, TCYOLO achieves performance gains in the three scenarios of yellow dotted lines, white dotted lines, and yellow solid lines, with their Average Precision (AP) increasing by 0.9%, 0.5%, and 3.8%, respectively. The improvement for yellow solid lines is particularly notable. Although the AP for white solid lines shows a slight decrease, it remains at a high accuracy level of 99.3%. This indicates that while maintaining high accuracy for easily detectable targets, TCYOLO significantly enhances the recognition capability for challenging samples, thereby improving the model’s robustness in complex road environments.

From the perspective of model complexity, TCYOLO has only 2.08 M parameters, which is 19.4% fewer than YOLOv11, and its model size is compressed to 4.8 MB, effectively reducing storage overhead and deployment costs. Furthermore, the overall precision of the model reaches 97.8%, further validating its comprehensive performance advantage. In summary, TCYOLO achieves an optimized balance between accuracy and efficiency, obtaining higher detection accuracy with a more lightweight model architecture.

To validate the model’s performance, this paper selects an image with a complex background containing small-scale objects. The image was captured using an M300 RTK (SZ DJI Technology Co., Ltd., Shenzhen, China) drone equipped with a Zenmuse P1 (SZ DJI Technology Co., Ltd., Shenzhen, China) gimbal camera. The inference results are shown in Figure 10. On the left, the YOLOv11n model failed to detect the yellow dotted line under the tree shadows, whereas on the right, the TCYOLO model successfully identified it. This demonstrates that the TCYOLO model possesses stronger resistance to background interference compared to YOLOv11n.

Table 11 presents a performance comparison between the TCYOLO-SofByteTrack algorithm and the baseline method YOLOv11 + ByteTrack on the self-collected highway dataset. The experimental results demonstrate that the improved algorithm achieves significant enhancements across multiple key metrics. HOTA increases from 66.53% to 70.55%, an improvement of 4.02%; IDSW decreases from 30 to 10, a reduction of 67%, which fully validates the effectiveness of the optical flow-based motion prediction module in improving ID stability; and IDF1 rises from 82.45% to 86.50%, indicating effective improvement in target association quality. Furthermore, the enhancements in MT and ML metrics demonstrate the advantage of the TCYOLO detector in small target recognition. Although the introduction of optical flow computation reduces FPS from 27.42 to 21.23, this computational overhead is considered acceptable given the significant improvement in tracking quality.

To validate the effectiveness of the proposed algorithm on the self-collected highway dataset, Figure 11 presents a comparison of tracking performance at intervals of 5 frames in a typical highway scenario. The first row shows the results of the baseline method YOLOv11-ByteTrack, while the second row displays the results of the TCYOLO-SofByteTrack algorithm. This test scenario includes complex interfering factors such as tree shadows and road markings, which pose high demands on the robustness of the tracking algorithm.

Significant differences can be observed from the comparative results. Regarding tracking stability, the baseline method exhibits unstable bounding box positioning when handling road vehicle targets. Particularly under UAV viewpoint changes or shadow interference, the bounding boxes show noticeable positional drift and size variation. In contrast, TCYOLO-SofByteTrack effectively suppresses abnormal fluctuations in the detection boxes by employing the optical flow-based motion prediction module and the camera motion compensation mechanism, maintaining more stable and accurate target localization across consecutive frames. The experimental results validate the superior performance of the proposed method in handling complex background interference and camera motion in real-world highway monitoring scenarios.

4. Discussion

Aiming at the challenges of low detection accuracy for small objects and poor tracking stability in UAV low-altitude remote sensing, this paper proposes a multi-object detection and tracking algorithm named TCYOLO-SofByteTrack, based on improved YOLOv11 and ByteTrack. Specifically, we design a C3k2-TA feature enhancement mechanism and a CCFM-UAV cross-scale fusion architecture to significantly boost the detector’s capability in small-object recognition. Furthermore, a motion compensation strategy based on sparse optical flow is introduced into the ByteTrack framework to effectively mitigate the instability of target bounding boxes caused by rapid UAV motion [63].

Experimental results demonstrate that the proposed TCYOLO detector achieves a 7.4% improvement in mAP for small-object detection compared to the baseline YOLOv11. In multi-object tracking tasks, TCYOLO-SofByteTrack attains HOTA, MOTA, and IDF1 scores of 45.3%, 42.7%, and 57.8%, respectively—reflecting improvements of 6.9%, 6.9%, and 11.2% over the baseline. Additionally, the number of mostly tracked targets increases by 37.3%, while identity switches are reduced by 23.4%. These results indicate that the proposed method achieves improved performance in detection accuracy, tracking precision, and identity consistency across multiple datasets, while maintaining practical applicability for UAV-based vision tasks.

5. Conclusions

Despite these promising results, this study has several limitations. First, the proposed method relies on high-quality annotated data, and its performance may degrade under severe occlusion or extreme lighting. Second, although the motion compensation strategy improves tracking stability, it introduces additional computational overhead that may hinder real-time performance on resource-constrained UAV platforms. Third, the current framework primarily uses visual information without incorporating multi-modal data (e.g., infrared or depth sensors), limiting its robustness in complex environments [64]. Moreover, while the proposed framework improves multi-scale feature representation through attention and cross-scale fusion mechanisms, it does not explicitly incorporate frequency-domain modeling or dedicated denoising constraints. This may limit robustness in scenarios with severe motion blur, atmospheric disturbance, or low signal-to-noise ratios commonly encountered in UAV remote sensing imagery. Additionally, the benchmarks and baseline models used in our simulations are primarily major approaches in this field, and some advanced models tailored for specific subtasks or challenging scenarios may not have been fully compared. Thus, the generalizability of our method in those specialized contexts requires further investigation.

Future work will focus on several directions. First, lightweight optimization and hardware-aware acceleration strategies will be explored to further improve deployment efficiency on embedded UAV platforms. Second, multi-modal perception frameworks integrating infrared, depth, or radar information will be investigated to enhance robustness in adverse environments. Third, explicit frequency-domain modeling and denoising-guided feature learning will be introduced to improve resistance against motion blur and complex background interference in remote sensing imagery. Finally, future research will further investigate end-to-end collaborative optimization between detection and tracking modules to improve global temporal consistency and tracking robustness.

Author Contributions

Conceptualization, L.S. and J.H.; methodology, F.S.; software, F.S.; validation, F.S.; formal analysis, L.S.; investigation, F.S. and Z.X.; resources, F.S. and Z.X.; data curation, J.H., F.S. and Z.X.; writing—original draft preparation, Z.X.; writing—review and editing, L.S., F.S., J.H. and J.F.; visualization, J.F.; supervision, L.S., J.H. and J.F.; project administration, J.H.; funding acquisition, J.H. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Inner Mongolia Autonomous Region (No. 2025LHMS06009), the Science and Technology Program of Inner Mongolia Autonomous Region (No. 2025YFHH0156), and the Science and Technology Program of Inner Mongolia Autonomous Region (No. 2023YFJM0002).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Z.; Hu, X.; Li, J.; Chen, J.; Huang, W.; Zhao, X.; Leung, V.C.M. Graph relation network for person counting in construction site using UAV. Appl. Soft Comput. J. 2021, 110, 107562. [Google Scholar] [CrossRef]
Wang, K.; Wu, Q.; He, X.; Hu, C.; Chen, N. Optimizing UAV traffic monitoring routes during rush hours considering spatiotemporal variation of monitoring demand. Int. J. Geogr. Inf. Sci. 2022, 36, 2086–2111. [Google Scholar] [CrossRef]
Ying, S.; Li, X.; Xie, Q. Secure and efficient V2ES authentication protocol and faster-RCNN based object detection scheme for connected autonomous vehicles. Veh. Commun. 2026, 58, 100997. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Wang, G.; Chen, Y.; An, P. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Wang, Q.; Chen, M. Improved YOLOX-X based UAV aerial photography object detection algorithm. Image Vis. Comput. 2023, 135, 104697. [Google Scholar] [CrossRef]
Yan, H.; Kong, X.; Shimada, T.; Tomiyama, H. TOE-YOLO: Accurate and efficient detection of tiny objects in UAV imagery. J. Real-Time Image Process. 2025, 22, 194. [Google Scholar] [CrossRef]
Deng, Z.; Ye, Y.; Guo, J. EHDC-YOLO: Enhancing Object Detection for UAV Imagery via Multi-Scale Edge and Detail Capture. Comput. Mater. Contin. 2026, 86, 069090. [Google Scholar] [CrossRef]
Han, Y.; Wang, C.; Luo, H.; Wang, H.; Chen, Z.; Xia, Y.; Yun, L. LRDS-YOLO enhances small object detection in UAV aerial images with a lightweight and efficient design. Sci. Rep. 2025, 15, 22627. [Google Scholar] [CrossRef] [PubMed]
Su, Z.; Wang, H.; Qi, L. Single-layer denoising Taylorformer for UAV nighttime tracking. In Image and Graphics; ICIG 2025; Lecture Notes in Computer Science; Springer: Singapore, 2026; Volume 16161, pp. 126–137. [Google Scholar]
Tian, S.; Zhang, B.; Cao, L.; Kang, L.; Tian, J.; Xing, X.; Shen, B.; Fan, C.; Du, K.; Fu, C.; et al. MFDAFF-net: Multiscale frequency-aware and dual attention-guided feature fusion network for UAV imagery object detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 10640–10656. [Google Scholar] [CrossRef]
Wan, X.; Chen, F.; Gao, W.; He, Y.; Liu, H.; Li, Z. Efficient spectral-spatial fusion with multiscale and adaptive attention for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 1196–1211. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: New York, NY, USA, 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral Video Tracker based on Spectral Difference Matching Reduction and Deep Spectral Target Perception Features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
Semerikov, S.O.; Nechypurenko, P.P.; Vakaliuk, T.A.; Mintii, I.S.; Kolhatin, A.O. Vision-Based Autonomous UAV Landing: A Comprehensive Review of Technologies, Techniques, and Applications. J. Intell. Robot. Syst. 2025, 111, 115. [Google Scholar] [CrossRef]
Zhao, D.; Xu, X.; You, M.; Arun, P.V.; Zhao, Z.; Ren, J.; Wu, L.; Zhou, H. Local Sub-block Contrast and Spatial-spectral Gradient Features Fusion for Hyperspectral Anomaly Detection. Remote Sens. 2025, 17, 695. [Google Scholar] [CrossRef]
Liu, K.-X.; Zhang, D.-X.; Wang, J.-Y.; Zheng, X.-Y.; Niu, F.-Q.; Wang, F.; Lin, Y.-S. YOLOv11n-ByteTrack: A method for counting farmed Holothurians based on unmanned surface vehicle video. Eng. Res. Express 2025, 7, 045259. [Google Scholar] [CrossRef]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar]
Liu, Q.; Chen, B.; Hao, Z.; Li, X.; Xiang, L.; Liu, J. Joint Sparse Optical Flow Estimation and Keypoint Detection via Dual-task Imperative Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 48, 2659–2675. [Google Scholar] [CrossRef]
Shi, H.; Zhu, Z.; Feng, X.; Xie, Y.; Guo, H.; Xue, P.; Wang, Y. A Detection Method of Pine Wilt Disease Based on Improved YOLOv11 With UAV Remote Sensing Images. Ecol. Evol. 2025, 15, e72823. [Google Scholar] [CrossRef] [PubMed]
Chang, H.; Long, Y.; Guo, Y.; Li, Y.; Wang, J.; Zhang, K.; Huang, L. Enhancing small object detection in UAV aerial imagery through integration of global edge information and multi-scale feature enhancement. Meas. Sci. Technol. 2026, 37, 066106. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, G.; Shen, J. Precise tea leaf disease detection using UAV low-altitude remote sensing and optimized YOLO11 model. PLoS ONE 2026, 21, e0342545. [Google Scholar] [CrossRef]
Zhang, W.; Liao, M. Cross-scale adaptive transformer with hierarchical feature synergy for aerial small object detection. Pattern Recognit. 2026, 173, 112822. [Google Scholar] [CrossRef]
Li, F.; Zhang, Y.; Fan, Y. EABI-DETR: An Efficient Aerial Small Object Detection Network. Biomimetics 2025, 10, 770. [Google Scholar] [CrossRef]
Zhao, D.; Gu, L.; Qian, K.; Zhou, H.; Yang, T.; Cheng, K. Target Tracking from Infrared Imagery via an Improved Appearance Model. Infrared Phys. Technol. 2020, 104, 103116. [Google Scholar] [CrossRef]
Luu, T.T.; Huynh, A.D. An efficient lightweight multi-scale CNN framework with CBAM and SPP for bearing fault diagnosis. Intell. Syst. Appl. 2026, 29, 200628. [Google Scholar] [CrossRef]
Qi, L.; Wang, X.; Wang, T.; Liu, S.; Li, T.; Li, X. SGBMYOLO: A high-precision obstacle detection algorithm for assistive navigation based on stereo vision and spatial attention mechanism. Appl. Soft Comput. 2026, 186, 114144. [Google Scholar]
Zhao, D.; Zhou, L.; Li, Y.; He, W.; Arun, P.V.; Zhu, X.; Hu, J. Visibility Estimation via Near-infrared Bispectral Real-time Imaging in Bad Weather. Infrared Phys. Technol. 2024, 136, 105008. [Google Scholar] [CrossRef]
Nan, H.; Li, C. Fpa-yolov8s: An efficient small object detection algorithm for drone aerial imagery. Pattern Anal. Appl. 2025, 28, 187. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; IEEE: New York, NY, USA, 2019; pp. 213–226. [Google Scholar]
Yi, W.; Song, X.; Yan, L. CSP-YOLOv11: An Intelligent Traffic Target Detection Algorithm in Complex Scenes. IAENG Int. J. Comput. Sci. 2026, 53, 1093. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Bhat, Y.R.; Keller, K.L.; Brick, T.R.; Pearce, A.L. ByteTrack: A deep learning approach for bite count and bite rate detection using meal videos in children. Front. Nutr. 2025, 12, 1610363. [Google Scholar] [CrossRef]
Li, R.; Liu, Q.; Wang, M.; Su, Y.; Li, C.; Ou, M.; Liu, L. Maize Kernel Batch Counting System Based on YOLOv8-ByteTrack. Sensors 2025, 25, 5584. [Google Scholar] [CrossRef]
Ngo Bibinbe, A.M.S.; Bang, C.; Gagnon, P.; Ahloy-Dallaire, J.; Paquet, E.R. An HMM-Based Framework for Identity-Aware Long-Term Multi-Object Tracking From Sparse and Uncertain Identification: Use Case on Long-Term Tracking in Livestock. Int. J. Comput. Vis. 2026, 134, 107. [Google Scholar] [CrossRef]
Yang, P.; Fan, Y.; Zhou, J.; Zheng, S.; Han, L. Vision-based estimation and tracking on fixed-wing multi-UAV systems. In Proceedings of the Chinese Control Conference (CCC), Chongqing, China, 28–30 July 2025; pp. 135–140. [Google Scholar]
Hou, M.; Wu, Y.; Shi, H.; Mu, X. A two stage multi object tracking algorithm with transformer and attention mechanism. Sci. Rep. 2025, 15, 31414. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Zhang, Y.; Da, F.; Zhou, H. Multi-Object Tracking Method Based on Domain Adaptation and Camera Motion Compensation. Electronics 2025, 14, 2238. [Google Scholar] [CrossRef]
Wang, J.; He, G.; Dai, X.; Wang, F.; Zhang, Y. Vision-Based Highway Lane Extraction from UAV Imagery: A Deep Learning and Geometric Constraints Approach. Electronics 2025, 14, 3554. [Google Scholar] [CrossRef]
Ma, L.; Xie, W. Enhanced RT-DETR: A lightweight model for aerial infrared small target detection. In Proceedings of the Chinese Control Conference (CCC), Chongqing, China, 28–30 July 2025; pp. 344–349. [Google Scholar]
Cui, T.; Li, W.; Liu, S.; Qu, C.; Sun, C.; Yang, X.; Sohel, F.; Li, W. An infrared multi-object tracking method for automatic circadian behavior analysis of adult Tuta absoluta. Smart Agric. Technol. 2025, 12, 101624. [Google Scholar] [CrossRef]
Savas, S.; Baykal Kablan, E.; Ekinci, M.; Ayas, S.; Baykal Selçuk, L.; Aksu Arıca, D.; Gulsoy, T. Facial Botox Injection Point Detection Using YOLOv8 Enhanced with CBAM and BiFPN: A Multi-Perspective Deep Learning Approach. J. Imaging Inform. Med. 2026, 1–18. [Google Scholar] [CrossRef] [PubMed]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. Mffsodnet: Multi-scale feature fusion small object detection network for uav aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Xu, Z.; Zhao, H.; Liu, P.; Wang, L.; Zhang, G.; Chai, Y. SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries. Remote Sens. 2025, 17, 3414. [Google Scholar] [CrossRef]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G. UAV-DETR: Efficient End-to-End Object Detection for UAV Imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N.; Wang, X. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 13668–13677. [Google Scholar]
Gan, Y.; Ren, X.; Xu, J.; Lin, P.; Chen, Y. AR-DETR: A lightweight RT-DETR-based framework for robust aircraft detection in complex remote sensing imagery. Meas. Sci. Technol. 2026, 37, 045411. [Google Scholar] [CrossRef]
Zhao, D.; Tang, L.; Arun, P.V.; Asano, Y.; Zhang, L.; Xiong, Y.; Tao, X.; Hu, J. City-Scale Distance Estimation via Near-Infrared Trispectral Light Extinction in Bad Weather. Infrared Phys. Technol. 2023, 128, 104507. [Google Scholar] [CrossRef]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: New York, NY, USA, 2023; pp. 3025–3029. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9686–9696. [Google Scholar]
Yang, M.; Han, G.; Yan, B.; Zhang, W.; Qi, J.; Lu, H.; Wang, D. Hybrid-sort: Weak cues matter for online multi-object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6504–6512. [Google Scholar]
Zhao, D.; Zhang, H.; Arun, P.V.; Jiao, C.; Zhou, H.; Xiang, P.; Cheng, K. SiamSTU: Hyperspectral Video Tracker based on Spectral Spatial Angle Mapping Enhancement and State Aware Template Update. Infrared Phys. Technol. 2025, 150, 105919. [Google Scholar] [CrossRef]
Li, X.; Zhu, R.; Yu, X.; Wang, X. High-Performance Detection-Based Tracker for Multiple Object Tracking in UAVs. Drones 2023, 7, 681. [Google Scholar] [CrossRef]
Pan, S.; Wang, H.; Li, D.; Zhang, Y.; Shiragapur, B.; Liu, X.; Yu, Q. A lightweight robust RGB-T object tracker based on Jitter Factor and associated Kalman filter. Inf. Fusion 2025, 117, 102842. [Google Scholar] [CrossRef]
Li, S.; Guo, A.; Wang, Y.; Li, L.; Wang, G.; Wu, Z. Combining visual motion and luminance features to enhance the detection of small moving objects in a bioinspired model. PLoS Comput. Biol. 2026, 22, e1014036. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Zhou, Y.; Sui, H.; Ma, G.; Cheng, P.; Wang, D. Multi-object tracking of vehicles and anomalous states in remote sensing videos: Joint learning of historical trajectory guidance and ID prediction. ISPRS J. Photogramm. Remote Sens. 2026, 233, 383–406. [Google Scholar] [CrossRef]
Zhao, D.; Zhong, W.; Ge, M.; Jiang, W.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral Video Tracker based on Band Correlation Grouping and Spatial-Spectral Information Interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]

Figure 1. TCYOLO structure diagram.

Figure 2. C3k2-TA structure diagram.

Figure 3. TA attentional structure diagram.

Figure 4. CCFM-UAV structure diagram.

Figure 5. ByteTrack flow diagram.

Figure 6. Representative examples (a) and statistical summary (b) of the UAV aerial road marking dataset.

Figure 7. Model performance comparison.

Figure 8. Visualized result graph.

Figure 9. Target tracking visualization graph.

Figure 10. Diagram of Comparison of Reasoning Signposts.

Figure 11. Road sign tracking comparison chart.

Table 1. GSD at different flight altitudes (Camera P1 and H20).

Flight Altitude (m)	P1 GSD (cm/pixel)	H20 GSD (cm/pixel)
9	0.1125	0.318
10	0.125	0.353
11	0.1375	0.388

Table 2. Experimental parameter setting table.

Parameters	Numeric
epoch	300
batch size	8
Image size	640 × 640
lr0	0.01
momentum	0.937
weight decay	0.0005
Workers	4
Optimizer	SGD

Table 3. Embedded device parameters.

Name	NVIDIA Jetson Orin NX
CPU	8-core ARM Cortex-A78
GPU	NVIDIA Ampere architecture GPU
Memory	16G
Computility	157TOPS
Operating System	Ubuntu 22.04 focal

Table 4. Attention Contrast Experiment Diagram.

Model	mAP@0.5%	mAP@0.5:0.95	P/%	R/%	Params/M	FLOPs/G	Model Size/MB
YOLOv11n	33.3	19.5	43.5	33.2	2.58	6.3	5.5
SE	33.4	19.4	43.6	33.8	2.58	6.3	5.5
MoCAttention	33.1	19.3	43.3	33.8	2.79	6.5	5.9
CBAM	33.3	19.4	45.5	33.3	2.69	6.5	5.7
TA	33.5	19.5	44.3	33.4	2.58	6.6	5.5

Table 6. Detailed results of the ablation experiment.

Basic Model	C3k2-TA	CCFM-UAV	mAP@0.5%	mAP@0.5:0.95	P/%	R/%	Params/M	FLOPs/G	Model Size/MB
YOLOv11n			33.3	19.5	43.5	33.2	2.58	6.3	5.5
	√		33.5	19.5	44.3	33.4	2.58	6.6	5.5
		√	40	24.3	50.8	38.7	2.08	13.0	4.7
	√	√	40.7	24.7	50	39.2	2.08	13.2	4.7

Table 7. Comparison Experiment on the VisDrone2019 Dataset.

Model	mAP@0.5/%	mAP@0.5:0.95/%	P/%	R/%	FLOPs/G	Params/M	Model Size/MB	FPS/(frame $\cdot s^{- 1}$ )
YOLOv8n	33.0	19.2	43.7	33.8	8.2	3.01	6.3	145
YOLOv9t	33.0	19.3	43.8	33.1	6.4	1.73	4.2	100
YOLOv10n	33.1	19.2	43.4	33.1	8.4	2.69	5.8	135
YOLOv11n	33.3	19.5	44.2	34.2	6.3	2.58	5.5	141
YOLOv12n	33.7	19.4	44.5	33.6	6.3	2.56	5.3	119
YOLOv8s	39.7	23.8	50.3	38.7	28.6	11.13	22.5	112
YOLOv10s	39.2	23.7	49.6	37.8	24.5	8.04	16.5	109
YOLOv11s	39.5	23.7	52.0	37.7	21.3	9.41	19.2	106
YOLOv12s	39.9	23.9	50.4	39.1	21.2	9.23	18.0	119
MFFSODNet [48]	46.3	26.9	55.8	44.3	55.8	4.54	9.8	82
RT-DETR-L	47.8	29.2	61.7	46.5	88.9	30.30	57.9	74
DEIM [49]	39.1	22.2	—	—	7.1	3.70	14.7	110
Faster-RCNN	21.6	9.8	35.5	31.4	134.2	41.7	108	21
TOE-YOLO [8]	33.8	19.7	45.0	33.7	6.6	2.62	—	—
LRDS-YOLO [10]	43.6	26.6	53.3	41.6	24.1	4.17	—	—
SRTSOD-YOLO-n [50]	36.3	21.8	—	—	7.4	3.5	—	147
UAV-DETR [51]	52.5	32.7	65.1	50.8	72.5	21.2	41.3	70
QueryDet [52]	31.6	17.4	—	—	44.3	18.9	—	58
TCYOLO	40.7	24.7	50.0	39.2	13.2	2.08	4.7	129

Table 8. Tracking algorithm comparison experiment.

Detection Model	Tracker	HOTA ↑	MOTA ↑	MOTP ↑	MT ↑	ML ↓	IDF1 ↑	IDSW ↓	FPS ↑
YOLOv11n	ByteTrack	38.4	35.8	76.0	399	736	46.6	1049	48.3
	BoT-SORT [42]	40.5	33.6	77.4	379	822	49.4	215	20.5
	Deep oc-sort [55]	42.0	37.6	76.4	436	695	52.8	746	18.7
	strongsort [56]	39.8	33.8	77.5	355	781	49.1	453	15.1
	Deepsort	39.9	39.3	75.6	478	615	48.8	1673	20.2
	OCSORT [57]	35.2	30.3	78.0	287	857	41.3	804	43.5
	imprassoc	39.6	38.7	75.7	537	581	44.8	3546	8.4
	hybridsort [58]	39.7	38.4	75.7	530	587	48.0	3473	17.6
	boosttrack	35.9	27.6	76.4	273	945	43.2	305	8.0
	SofByteTrack	40.6	36.7	76.1	427	723	50.2	705	43.5
TCYOLO	ByteTrack	42.9	41.7	75.2	521	613	53.5	1224	36.9
TCYOLO	SofByteTrack	45.3	42.7	75.4	548	592	57.8	803	34.1

↑: An up arrow indicates an improvement. ↓: A downward arrow indicates a decrease.

Table 9. Experimental results of the NVIDIA Jetson Orin NX platform.

Model	Precision Calibration	mAP@0.5%	mAP@0.5:0.95%	P/%	R/%	Model Size/MB	FPS
YOLOv11n	None	34	19.6	44.6	33.8	5.5	29.54
	FP32	33.8	19.5	44.3	33.6	12.4	33.70
	FP16	33.9	19.5	44.3	33.8	8.7	43.79
TCYOLO	None	41.7	24.9	50	40.7	4.7	19.36
	FP32	41.8	24.8	50.8	40.4	11.6	23.18
	FP16	41.7	24.7	50.5	40.5	7.8	33.02

Table 10. Self-made dataset comparison experiment.

Model	AP/%				mAP@0.5/%	Params/M	FLOPs/G	Model Size/MB	P/%
Model	White Solid Line	Yellow Dotted Line	White Dotted Line	Yellow Solid Line	mAP@0.5/%	Params/M	FLOPs/G	Model Size/MB	P/%
YOLOv11n	99.5	96.5	90.9	95.7	95.6	2.58	6.3	5.5	96.9
TCYOLO	99.3	97.4	91.4	99.5	96.9	2.08	13.2	4.8	97.8

Table 11. Self-made dataset tracking comparison experiment.

Model	HOTA ↑	MOTA ↑	MOTP ↑	MT ↑	ML	IDF1 ↑	IDSW ↓	FPS ↑
YOLOv11-ByteTrack	66.53	73.18	80.028	85	17	82.448	30	27.42
TCYOLO-SofByteTrack	70.553	74.118	80.639	90	11	86.498	10	21.23

↑: An up arrow indicates an improvement. ↓: A downward arrow indicates a decrease.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, J.; Sun, F.; Xu, Z.; Song, L.; Fang, J. An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery. Remote Sens. 2026, 18, 1547. https://doi.org/10.3390/rs18101547

AMA Style

Han J, Sun F, Xu Z, Song L, Fang J. An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery. Remote Sensing. 2026; 18(10):1547. https://doi.org/10.3390/rs18101547

Chicago/Turabian Style

Han, Jianfeng, Feijie Sun, Zihan Xu, Lili Song, and Jiandong Fang. 2026. "An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery" Remote Sensing 18, no. 10: 1547. https://doi.org/10.3390/rs18101547

APA Style

Han, J., Sun, F., Xu, Z., Song, L., & Fang, J. (2026). An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery. Remote Sensing, 18(10), 1547. https://doi.org/10.3390/rs18101547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced Algorithm Integrating YOLOv11 and ByteTrack for Small-Object Detection and Tracking in Low-Altitude Remote Sensing Imagery

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. The TCYOLO Small-Object Detection Algorithm

2.1.1. C3k2-TA

2.1.2. CCFM-UAV

2.2. SofByteTrack Object Tracking Algorithm

2.2.1. The ByteTrack Algorithm

2.2.2. Optimization of Object Tracking Algorithms

2.3. Training and Deployment Pipeline

3. Results

3.1. Dataset and Parameter Configuration

3.1.1. Dataset

3.1.2. Parameter Configuration

3.2. Evaluation Index

3.3. Experimental Results and Analysis

3.3.1. Attention Ablation Experiment

3.3.2. Neck Comparison Experiment

3.3.3. Ablation Experiment

3.3.4. Comparison Experiments of Different Detection Models

3.3.5. Object Detection Visualization

3.3.6. Comparison Experiment of Target Tracking Algorithms

3.3.7. Visualization of Target Tracking Algorithm

3.3.8. Algorithm Deployment Experiment

3.3.9. Generalization Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI