3.3.2. Neck Comparison Experiment
To validate the effectiveness of the proposed feature fusion method, this study conducts a comparative analysis between CCFM-UAV and several existing mainstream feature fusion approaches. All compared methods are configured with the same small-object detection head for a fair assessment. The experimental results are presented in
Table 5.
The compared methods include: (1) PAFPN: The feature fusion technique adopted in YOLOv11, which enhances information flow and aggregation between feature maps by adding horizontal connections along the top-down and bottom-up pathways, achieving multi-scale feature representation. (2) Slim-neck: An efficient neck architecture that utilizes GSConv and VoV-GSCSP modules. It maintains detection accuracy while reducing computational complexity through multi-scale feature fusion and hierarchical processing. (3) BiFPN: Building upon PAFPN, it assigns learnable weights to each input feature to optimize the fusion process. Its structure is optimized by removing single-input nodes, adding extra connections between input and output nodes at the same level, and treating each bidirectional path as a standalone feature network layer, thereby enhancing cross-scale connectivity [
46].
Table 5.
Comparison diagram of neck networks.
Table 5.
Comparison diagram of neck networks.
| Model | mAP@0.5% | mAP@0.5:0.95 | P/% | R/% | Params/M | FLOPs/G | Model Size/MB |
|---|
| PAFPN | 36.3 | 21.9 | 47.6 | 35.3 | 2.6 | 10.2 | 5.8 |
| Slim-neck [47] | 35.6 | 21.3 | 46.4 | 35.1 | 2.65 | 9.8 | 5.8 |
| BiFPN | 39.3 | 23.7 | 49.4 | 37.6 | 2.76 | 12.8 | 6.0 |
| CCFM-UAV | 40 | 24.3 | 50.8 | 38.7 | 2.08 | 13.0 | 4.7 |
The experimental results demonstrate that the proposed CCFM-UAV exhibits significant advantages in both detection performance and model efficiency. Compared with the original PAFPN neck, CCFM-UAV improves precision, recall, mAP@0.5, and mAP@0.5:0.95 by 3.2%, 3.4%, 3.7%, and 2.4%, respectively, while simultaneously reducing the parameter count from 2.6 M to 2.08 M and decreasing the model size from 5.8 MB to 4.7 MB.
The parameter reduction mainly benefits from three structural optimizations. First, all intermediate feature maps in the neck are unified to a fixed width of 256 channels, which eliminates redundant high-dimensional feature transformations and reduces channel adaptation overhead. Second, CCFM-UAV replaces part of the concatenation-dominated fusion strategy in PAFPN with element-wise addition for same-scale feature fusion, thereby avoiding channel expansion after fusion and reducing the parameter burden of subsequent convolution layers. Third, lightweight 1 × 1 projection convolutions are employed before feature fusion and upsampling operations, which further decreases computational redundancy while preserving discriminative spatial information.
Although an additional P2 detection head is introduced, its parameter overhead remains limited because shallow backbone features are compressed before entering the fusion pathway. These improvements demonstrate that CCFM-UAV achieves a more effective balance between multi-scale feature representation capability and lightweight model design for UAV small-object detection tasks.
3.3.3. Ablation Experiment
Based on the YOLOv11n baseline model, this study conducted ablation experiments to validate the effectiveness of the proposed C3k2-TA module and CCFM-UAV module. The experimental results show (
Table 6) that after independently introducing the C3k2-TA module, while the number of parameters and model size remain unchanged, the precision increases by 0.8%. This indicates that the module achieves marginal performance gains with minimal computational overhead.
When the CCFM-UAV module is introduced independently, mAP@0.5 improves significantly by 6.7%, and mAP@0.5:0.95 increases by 4.8%. Meanwhile, the number of parameters is reduced by 19.4%, and the model size decreases by 14.5%. This demonstrates that the CCFM-UAV module serves as the primary contributor to performance enhancement, effectively improving detection accuracy while achieving notable model compression.
When both modules are jointly applied, the model attains its optimal performance, with mAP@0.5 reaching 40.7% and mAP@0.5:0.95 reaching 24.7%, representing improvements of 7.4% and 5.2%, respectively, compared to the baseline model. In summary, the ablation experiments confirm the effectiveness of each proposed module and their synergistic contribution to performance gains.
3.3.4. Comparison Experiments of Different Detection Models
To further demonstrate the advantages of the proposed algorithm, in addition to YOLOv8n, models including YOLOv8s, YOLOv9t, YOLOv10n, YOLOv10s, YOLOv11n, and YOLOv11s were selected for experiments on the VisDrone2019 dataset. The relevant experimental results are presented in
Table 7.
As can be observed from the results in
Table 7, models such as YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, and the latest YOLOv12n are of a similar scale to the proposed TCYOLO in terms of model size, with fewer parameters and faster deployment speed—some even achieve higher FPS. However, they exhibit limitations in feature representation, making them prone to false positives and missed detections, which results in significantly lower performance compared to the proposed model. For instance, compared to YOLOv11n, TCYOLO improves mAP@0.5, mAP@0.5:0.95, precision (P), and recall (R) by 7.4%, 5.2%, 5.8%, and 5.0%, respectively, while reducing parameter count and model size by 19.3% and 14.5%. Even against the newest YOLOv12n, TCYOLO achieves a substantial gain of 7.0% in mAP@0.5 and 5.3% in mAP@0.5:0.95, with comparable model complexity, demonstrating its superior feature extraction capability.
Models such as YOLOv8s, YOLOv11s, and YOLOv12s are significantly larger and more complex than TCYOLO, making them difficult to deploy and run stably on embedded airborne devices of UAVs. Moreover, their performance and FPS are lower than those of the proposed model. For example, compared to YOLOv12s, TCYOLO improves mAP@0.5 and mAP@0.5:0.95 by 0.8% and 0.8%, respectively, while maintaining a much lower model complexity (2.08 M vs. 9.23 M parameters, 13.2 vs. 21.2 GFLOPs), which facilitates its deployment on embedded platforms.
MFFSODNet employs a designed Multi-scale Feature Extraction Module (MSFEM) to achieve significant performance gains, but its high computational overhead compromises real-time inference capability [
44]. Among Transformer-based models, RT-DETR, an end-to-end real-time object detector evolved from DETR, enjoys enhanced detection accuracy [
53]. However, its high parameter complexity and computational demands restrict its applicability for stable operation on resource-constrained UAV onboard platforms. DEIM, an improved DETR model based on optimized matching mechanisms, demonstrates lower detection accuracy (39.1% mAP@0.5 and 22.2% mAP@0.5:0.95) compared to TCYOLO (40.7% mAP@0.5 and 24.7% mAP@0.5:0.95), while also incurring a notably larger model size (14.7 MB vs. 4.7 MB) and higher parameter count (3.7 M vs. 2.08 M). Faster R-CNN, as a representative two-stage algorithm, suffers from high model complexity (41.7 M parameters, 134.2 GFLOPs, and 108 MB) and exhibits particularly weak performance in small-object detection, with its mAP@0.5:0.95 (9.8%) substantially lower than that of TCYOLO (24.7%).
To further benchmark against recent state-of-the-art (SOTA) detectors, we include TOE-YOLO, LRDS-YOLO, SRTSOD-YOLO-n, UAV-DETR, and QueryDet for a comparative analysis focused on the accuracy-efficiency trade-off. The quantitative results are presented in
Table 7.
Unlike methods that pursue extreme accuracy at the cost of high computational overhead, TCYOLO is specifically designed to achieve a favorable balance between detection performance and model complexity—a critical requirement for resource-constrained UAV platforms. Its incremental contributions over existing SOTA detectors lie in two unique architectural designs: (1) the C3k2-TA module, which enhances small-object feature representation via lightweight triplet attention, and (2) the CCFM-UAV neck, which strengthens cross-scale feature fusion while simultaneously reducing parameter count. These designs enable competitive accuracy without sacrificing inference speed or model compactness.
Among the compared lightweight SOTA methods, TOE-YOLO adopts rotated feature extraction and attention-based concatenation, maintaining a lightweight profile (6.6 GFLOPs, 2.62 M parameters). However, its detection accuracy (33.8% mAP@0.5, 19.7% mAP@0.5:0.95) falls significantly below that of TCYOLO, suggesting that its feature representation capacity is insufficient for the challenging small-object scenarios in UAV imagery. Similarly, QueryDet employs query-based sparse feature sampling to reduce redundant computations, yet its performance remains limited, achieving only 31.6% mAP@0.5 and 17.4% mAP@0.5:0.95 despite relatively high computational complexity (44.3 GFLOPs, 18.9 M parameters). This indicates that lightweight query-driven strategies alone are insufficient to effectively capture fine-grained UAV object features. SRTSOD-YOLO-n achieves the highest inference speed (147 FPS) with a comparable parameter budget, yet its accuracy (36.3% mAP@0.5, 21.8% mAP@0.5:0.95) remains moderate, indicating a trade-off that favors speed over detection quality. In contrast, TCYOLO surpasses TOE-YOLO, QueryDet, and SRTSOD-YOLO-n in mAP@0.5 by 6.9, 9.1, and 4.4 percentage points, respectively, while maintaining a real-time inference speed of 129 FPS and an exceptionally low parameter count (2.08 M) that is among the smallest of all compared detectors—a direct benefit of the parameter-efficient CCFM-UAV design.
On the other end of the spectrum, LRDS-YOLO and UAV-DETR achieve superior detection accuracy. LRDS-YOLO reaches 43.6% mAP@0.5 and 26.6% mAP@0.5:0.95 through lightweight downsampling and re-calibration mechanisms. UAV-DETR further improves detection performance, achieving the highest accuracy among all compared methods with 52.5% mAP@0.5 and 32.7% mAP@0.5:0.95, benefiting from transformer-based global feature modeling and enhanced long-range dependency learning. However, these methods incur substantially higher computational costs. LRDS-YOLO requires 24.1 GFLOPs, while UAV-DETR reaches 72.5 GFLOPs with 21.2 M parameters and a model size of 41.3 MB—far exceeding the resource budget of TCYOLO (13.2 GFLOPs, 2.08 M parameters, 4.7 MB). In particular, UAV-DETR introduces more than five times the computational complexity and over ten times the parameter count of TCYOLO, resulting in a lower inference speed of 70 FPS. Such computational burdens limit their deployability on UAV-embedded platforms where power consumption, memory footprint, and latency constraints are stringent. TCYOLO, while trading off approximately 3–12 percentage points in mAP@0.5, reduces GFLOPs by nearly half compared with LRDS-YOLO, and by more than 80% compared with UAV-DETR, while achieving a highly compact model size of only 4.7 MB, making it one of the most lightweight models in the evaluation.
In summary, among the compared SOTA detectors, TCYOLO uniquely occupies an advantageous operating point on the accuracy–efficiency curve: it achieves the highest mAP@0.5 (40.7%) within the lightweight detector category (≤2.1 M parameters, ≤15 GFLOPs), while maintaining an exceptionally compact model size (4.7 MB) and real-time inference speed. This balanced performance profile stems directly from the synergistic design of C3k2-TA and CCFM-UAV, which jointly enhance small-object feature extraction without introducing excessive computational overhead—a combination that distinguishes TCYOLO from existing methods optimized predominantly for either accuracy or speed alone.
Figure 7 illustrates the trade-off between computational complexity and detection accuracy across different detection models. The red dashed line represents the high-performance threshold (
> 40%), and the green dashed line indicates the efficiency threshold (FLOPs ≤ 20 ×
). It can be clearly observed from
Figure 7 that TCYOLO falls exactly within the optimal performance region bounded by the two threshold lines, simultaneously satisfying the requirements of high accuracy (
> 40%) and high efficiency (FLOPs ≤ 20 ×
). Compared with other detection models, TCYOLO achieves a better balance between accuracy and efficiency, attaining competitive detection accuracy while maintaining low computational overhead. This verifies the effectiveness of the proposed improvement strategy and offers an optimized solution for real-time detection in resource-constrained environments.
3.3.6. Comparison Experiment of Target Tracking Algorithms
This study proposes the TCYOLO detector and the SofByteTrack tracker, both of which contribute significant performance improvements in multi-object tracking tasks, as shown in
Table 8.
From the perspective of detector enhancement, when TCYOLO is paired with the original ByteTrack tracker, it achieves an HOTA score of 42.9%, representing an improvement of 4.5 percentage points compared to the YOLOv11 + ByteTrack baseline. MOTA increases by 5.9 percentage points, and IDF1 reaches 53.5%. These results indicate that TCYOLO significantly improves target localization accuracy and feature representation capability, thereby providing higher-quality detection results for subsequent tracking.
Regarding the tracker improvement, the combination of YOLOv11 and SofByteTrack already shows a clear performance gain over other trackers. However, the synergistic combination of TCYOLO and SofByteTrack achieves the optimal performance, with HOTA reaching 45.3%, MOTA at 42.7%, and IDF1 as high as 57.8%. All three core metrics are the highest in the table, validating the superiority of the improved tracker in data association and trajectory management.
It is particularly noteworthy that the TCYOLO-SofByteTrack combination demonstrates exceptional performance in target management. The number of Mostly Tracked targets (MT) reaches 548, and the number of Identity Switches (IDSW) is reduced to 803, which correspond to improvements of 37.3% and 23.4%, respectively, compared to the baseline method. This significantly enhances the continuity and stability of the tracking trajectories. Although the inference speed of this combination is not the fastest, it still meets the requirements for real-time applications.
In summary, the TCYOLO detector and SofByteTrack tracker work in synergy to achieve comprehensive optimization in detection accuracy, tracking precision, and identity consistency for multi-object tracking tasks. This provides a high-precision, highly robust solution for multi-object tracking in complex scenarios [
59].
3.3.7. Visualization of Target Tracking Algorithm
To validate the effectiveness of the proposed improved tracking algorithm, a visual comparative analysis was conducted using the sequence uav0000137_00458_v from the VisDrone2019-MOT dataset. This sequence possesses typical challenging characteristics. First, the UAV performs rapid horizontal translation during data acquisition, causing significant camera-induced global motion and substantial inter-frame pixel displacement. This places high demands on the algorithm’s motion modeling capability [
60]. Second, the sequence covers complex environments such as urban roads and parking lots, containing targets of various scales including vehicles and pedestrians, which ensures good scene diversity. Therefore, this sequence can effectively evaluate the performance of the proposed optical flow-based motion prediction module and camera motion compensation mechanism in practical applications.
Figure 9 presents a comparison of tracking performance at frames 195, 200, and 205. The first row shows the results of the baseline method, YOLOv11 + ByteTrack, while the second row displays the results of the proposed improved algorithm, TCYOLO + SofByteTrack, where significant differences can be observed.
In terms of stability, as indicated by the red annotated regions, the bounding boxes for vehicle targets in the first row exhibit noticeable localization drift and size fluctuation under rapid UAV rotation, with the tracker failing to maintain a stable lock [
61]. This primarily stems from the prediction bias of the traditional Kalman filter-based motion model under camera motion conditions. In contrast, the improved algorithm effectively suppresses abnormal bounding box fluctuations by introducing optical flow-based motion prediction and a camera motion compensation mechanism, thereby maintaining tracking continuity and stability.
Furthermore, regarding small-object detection, the baseline method shows severe missed detections for small-scale vehicle targets on distant streets [
62]. Conversely, the improved algorithm, leveraging the multi-scale feature fusion capability of the TCYOLO detector, significantly enhances the detection rate and tracking robustness for small targets. These visual results corroborate the quantitative evaluation metrics, jointly validating the superiority of the proposed method in UAV tracking scenarios.
3.3.9. Generalization Experiment
This study conducts a systematic comparative evaluation of the performance between the YOLOv11 and TCYOLO algorithms on the lane detection task.
As shown in the
Table 10, TCYOLO demonstrates excellent detection accuracy, achieving an mAP@0.5 of 96.9%, which represents a 1.3% improvement over YOLOv11’s 95.6%. In terms of per-category detection accuracy, TCYOLO achieves performance gains in the three scenarios of yellow dotted lines, white dotted lines, and yellow solid lines, with their Average Precision (AP) increasing by 0.9%, 0.5%, and 3.8%, respectively. The improvement for yellow solid lines is particularly notable. Although the AP for white solid lines shows a slight decrease, it remains at a high accuracy level of 99.3%. This indicates that while maintaining high accuracy for easily detectable targets, TCYOLO significantly enhances the recognition capability for challenging samples, thereby improving the model’s robustness in complex road environments.
From the perspective of model complexity, TCYOLO has only 2.08 M parameters, which is 19.4% fewer than YOLOv11, and its model size is compressed to 4.8 MB, effectively reducing storage overhead and deployment costs. Furthermore, the overall precision of the model reaches 97.8%, further validating its comprehensive performance advantage. In summary, TCYOLO achieves an optimized balance between accuracy and efficiency, obtaining higher detection accuracy with a more lightweight model architecture.
To validate the model’s performance, this paper selects an image with a complex background containing small-scale objects. The image was captured using an M300 RTK (SZ DJI Technology Co., Ltd., Shenzhen, China) drone equipped with a Zenmuse P1 (SZ DJI Technology Co., Ltd., Shenzhen, China) gimbal camera. The inference results are shown in
Figure 10. On the left, the YOLOv11n model failed to detect the yellow dotted line under the tree shadows, whereas on the right, the TCYOLO model successfully identified it. This demonstrates that the TCYOLO model possesses stronger resistance to background interference compared to YOLOv11n.
Table 11 presents a performance comparison between the TCYOLO-SofByteTrack algorithm and the baseline method YOLOv11 + ByteTrack on the self-collected highway dataset. The experimental results demonstrate that the improved algorithm achieves significant enhancements across multiple key metrics. HOTA increases from 66.53% to 70.55%, an improvement of 4.02%; IDSW decreases from 30 to 10, a reduction of 67%, which fully validates the effectiveness of the optical flow-based motion prediction module in improving ID stability; and IDF1 rises from 82.45% to 86.50%, indicating effective improvement in target association quality. Furthermore, the enhancements in MT and ML metrics demonstrate the advantage of the TCYOLO detector in small target recognition. Although the introduction of optical flow computation reduces FPS from 27.42 to 21.23, this computational overhead is considered acceptable given the significant improvement in tracking quality.
To validate the effectiveness of the proposed algorithm on the self-collected highway dataset,
Figure 11 presents a comparison of tracking performance at intervals of 5 frames in a typical highway scenario. The first row shows the results of the baseline method YOLOv11-ByteTrack, while the second row displays the results of the TCYOLO-SofByteTrack algorithm. This test scenario includes complex interfering factors such as tree shadows and road markings, which pose high demands on the robustness of the tracking algorithm.
Significant differences can be observed from the comparative results. Regarding tracking stability, the baseline method exhibits unstable bounding box positioning when handling road vehicle targets. Particularly under UAV viewpoint changes or shadow interference, the bounding boxes show noticeable positional drift and size variation. In contrast, TCYOLO-SofByteTrack effectively suppresses abnormal fluctuations in the detection boxes by employing the optical flow-based motion prediction module and the camera motion compensation mechanism, maintaining more stable and accurate target localization across consecutive frames. The experimental results validate the superior performance of the proposed method in handling complex background interference and camera motion in real-world highway monitoring scenarios.