1. Introduction
Unmanned aerial vehicle (UAV) technology has revolutionized applications requiring ground target detection, spanning precision agriculture, traffic monitoring, disaster response, and security surveillance [
1]. Recent reviews [
2,
3] have also demonstrated the excellent cross-domain applicability of UAVs. However, individual UAVs face inherent limitations including restricted fields of view, susceptibility to occlusion, constrained computational resources, and vulnerability to environmental interference. These limitations are particularly pronounced in detecting small targets (occupying fewer than 20 pixels) within complex aerial imagery, where conventional detection methods exhibit significant performance degradation [
4].
Collaborative perception through UAV swarms offers a compelling solution by leveraging distributed sensing capabilities and collective intelligence. Aggregating multi-perspective observations can enhance situational awareness, improve detection robustness, and increase coverage efficiency [
5]. Nevertheless, conventional centralized coordination approaches introduce critical bottlenecks: massive raw data transmission strains communication bandwidth, centralized storage of sensitive imagery creates privacy vulnerabilities, and inherent scalability constraints arise from bandwidth limitations and single-point-of-failure risks.
To overcome these fundamental limitations of centralized paradigms, federated learning (FL) emerges as a disruptive solution tailored for UAV swarm perception. FL fundamentally reframes the collaborative learning process: instead of aggregating raw data, swarm members collaboratively train a shared model by exclusively exchanging model parameters or gradients. This approach inherently addresses the trilemma of bandwidth, privacy, and scalability. Its core lies in transmitting only compressed model updates instead of massive raw image data, reducing bandwidth consumption by several orders of magnitude—a critical advantage for bandwidth-constrained aerial networks where communication energy expenditure exceeds 90% of total consumption. Meanwhile, sensitive visual data remains entirely localized within individual UAVs, fundamentally eliminating risks of information leakage or malicious interception in security-sensitive applications such as border patrols. Furthermore, the decentralized architecture removes single points of failure associated with central servers, ensuring the UAV swarm maintains collaborative operational capabilities even under dynamic network partitioning or partial member disconnections.
A primary challenge stems from severe data heterogeneity and distribution shifts: UAV swarms operate across diverse geographical regions, weather conditions, and dynamic mission scenarios, resulting in highly non-Independent and Identically Distributed (non-IID) data distributions. Traditional FL algorithms suffer considerable performance degradation under such heterogeneity, with studies reporting accuracy drops exceeding 50% [
6]. UAV mobility further exacerbates this issue by inducing temporal distribution shifts during environmental transitions. Compounding this challenge are stringent resource constraints and the need for communication efficiency. UAV platforms operate under tight computational, memory, and energy budgets. The frequent exchange of large model parameters in conventional FL imposes prohibitive communication costs, particularly problematic in bandwidth-constrained aerial environments, where communication overhead can account for up to 90% of total training energy consumption [
7]. Furthermore, the inherent limitations of detecting small targets in aerial imagery present a distinct obstacle. Minimal target footprints, variations in altitude, and complex backgrounds challenge standard convolutional architectures, which lose critical spatial information during downsampling, severely compromising detection accuracy for small targets.
To overcome the interconnected challenges of data privacy, severe non-IID data distributions, and the computational constraints of UAVs in ground target detection, we propose FedGTD-UAVs. While federated learning addresses privacy, its performance plummets under non-IID data and high communication costs. Similarly, existing detection models often lose critical spatial information through downsampling or lack the contextual reasoning needed for cluttered scenes without being too heavy for UAVs. Our framework bridges these gaps through three key contributions:
Unlike existing methods that fail to resolve these issues, we introduce the first FTL framework explicitly tailored for collaborative perception in UAV swarms, ensuring robust performance under challenging non-IID data distributions while safeguarding data privacy.
Compared to prevailing detectors (e.g., YOLO series, RT-DETR), we present a computationally efficient detection architecture that uniquely combines SPD-Conv for enhanced spatial feature preservation and GCNet for contextual attention, optimized for on-device execution.
Rigorous empirical validation on established benchmarks (VisDrone2019 and CARPK) demonstrates the framework’s state-of-the-art performance, achieving 44.2% mAP@0.5 (a 12.1% improvement over YOLOv8s) while operating at 217 FPS, with in-depth ablation studies further validating the effectiveness of each component.
The remainder of this paper is structured as follows.
Section 2 reviews foundational work in FL and UAV target detection.
Section 3 details the FedGTD-UAVs methodology and training protocols. Comprehensive experimental evaluations, including ablation studies and comparative analysis, are presented in
Section 4.
Section 5 concludes the paper, summarizing key findings and outlining future research directions.
4. Experimental Validation and Analysis
To evaluate the efficacy and generalizability of FedGTD-UAVs for collaborative perception in UAV swarms, we conduct comprehensive experiments across multiple dimensions. This section details the experimental setup, including (1) benchmark datasets and preprocessing, (2) implementation details (hardware and software), and (3) evaluation metrics. Our analysis examines three key aspects: component contributions via ablation studies, FL robustness under non-IID data distributions, and comparisons with state-of-the-art methods.
4.1. Dataset Construction and Preprocessing
We validate FedGTD-UAVs using two complementary aerial datasets with distinct characteristics.
VisDrone2019, meticulously curated and publicly released by the AISKYEYE Team (Tianjin, China), has emerged as a widely adopted authoritative benchmark in the field of UAV visual analysis. It includes 261,908 frames from 14 Chinese cities, spanning urban to rural scenes, varying object densities, and 10 categories with a long-tail distribution. It allocates 6471 images for training, 548 for validation, and 3190 for testing. As shown in
Figure 6, cars dominate (62.3%), while awning-tricycles are rare (0.7%), mirroring real-world patterns and posing generalization challenges.
CARPK was developed and open-sourced by the Shanghai AI Laboratory (Shanghai, China), serving as a critical benchmark for large-scale parking lot vehicle detection research. It contains 89,777 vehicle instances at 40 m altitude (1280 × 720 resolution) from four parking lots. We apply an 8:3:3 split, yielding 1120 training, 420 validation, and 420 testing samples for domain adaptation evaluation.
To emulate real-world FL challenges, we apply two partitioning strategies. Stratified IID partitioning preserves class proportions per UAV via stratified sampling, with a reserved global test set.
For non-IID scenarios, we use Dirichlet partitioning after merging training and validation sets, with parameter vector
defining sub-dataset weights. Smaller
values produce highly skewed and class-imbalanced client data, effectively mimicking challenging but realistic UAV swarm scenarios where devices operate in distinct regions or perform different tasks. By focusing on
, this generates partitions with heterogeneity, class imbalance, and dependencies, as shown in
Figure 7. The original test set remains unchanged for evaluation.
This dual-dataset approach assesses performance in complex environments (VisDrone2019), task-specific generalization (CARPK), and resilience to data skew (Dirichlet partitioning), addressing key gaps in aerial federated perception.
4.2. Experimental Setup
For reproducibility and fair comparison, all experiments use the hardware and software configurations in
Table 1. Dataset-specific protocols include 300 epochs for VisDrone2019 and 100 for CARPK, with mosaic augmentation disabled in the final 10 epochs to avoid feature distortion. Hyperparameters are consistent and detailed in
Table 2.
This setup ensures (1) reproducibility via containerization, (2) efficient convergence through tailored epochs, and (3) stable training by disabling augmentation late-stage. The batch size of 16 balances GPU memory and gradient stability.
4.3. Performance Evaluation Metrics
We assess detection accuracy, efficiency, and complexity. Accuracy metrics are as follows:
where
,
, and
are true positives, false positives, and false negatives;
N is the number of classes; and
is the AP for class
i.
We emphasize mAP@0.5 (IoU threshold 0.5) and mAP@0.5:0.95 (averaged over IoU 0.5 to 0.95 in 0.05 steps) for localization precision in UAV tasks.
Efficiency is measured by frames per second (FPS) for real-time inference. Complexity uses FLOPs per inference and parameter count (Params) for resource assessment on UAV hardware.
This framework evaluates detection reliability (mAP variants), latency (FPS), and deployability (FLOPs/Params), addressing UAV swarm needs.
4.4. Ablation Study of Architectural Innovations
Given YOLOv8’s leading industrial performance and benchmark status in academia, we selected it as the core baseline model to rigorously evaluate the generalization capability and deployment value of our architectural innovations in complex real-world scenarios. Comprehensive ablation studies (
Table 3) quantitatively validate the critical contributions of each architectural innovation component within the FedGTD-UAV framework to its overall performance.
Firstly, the SPD-Conv module significantly enhances small-target detection capabilities—essential for UAV applications—yielding a 3.1% absolute improvement in mAP@0.5 by preserving fine-grained spatial details during downsampling. Complementing this, the GCNet attention mechanism improves precision by 0.4% through contextual reasoning that effectively mitigates false positives in cluttered environments. Most notably, the FTL variant alone brings a substantial gain of 9.5% in mAP@0.5, underscoring the paramount importance of federated knowledge transfer in overcoming individual data limitations.
Secondly, the pairwise combinations reveal that FTL serves as a strong base for integration. While combining SPD-Conv and GCNet yields an additive effect (+3.5% vs. baseline), integrating either module with FTL results in higher performance gains (+4.6% and +4.7%, respectively), indicating that FTL provides a more robust feature representation for the subsequent modules to build upon.
Most notably, the full integration of all three components demonstrates a clear synergistic effect, achieving a 12.1% improvement in mAP@0.5. This final performance significantly exceeds the sum of their individual gains and any pairwise combination, highlighting that the components complement each other: SPD-Conv preserves critical details, GCNet suppresses noise, and FTL enables effective cross-domain knowledge integration.
Remarkably, the integrated architecture exhibits a synergistic effect: The full configuration achieves 44.2% mAP@0.5—representing a 12.1% absolute gain over the baseline—which, while slightly less than the naive sum of individual gains (3.1% from SPD-Conv, 0.4% from GCNet, and 9.5% from FTL), demonstrates clear synergy by substantially outperforming any single or pairwise combination. This emergent capability arises from complementary interactions among the components: SPD-Conv preserves critical target details, GCNet suppresses environmental noise, and FTL integrates cross-domain expertise across the UAV swarm, resulting in a system where the whole significantly exceeds the performance of partial integrations.
Operationally, despite the computational overhead introduced by GCNet (adding 1.6M parameters), the FedGTD-UAV maintains real-time viability at 217.3 FPS (4.6 ms latency) on embedded platforms. Crucially, FTL alone delivers 78.5% of the FedGTD-UAVs’ accuracy gain with minimal computational penalty, providing substantial deployment flexibility for resource-constrained UAV platforms where model size and inference speed are paramount.
4.5. Validate the Effectiveness of GCNet
To evaluate the effectiveness of the GCNet attention mechanism, comparative experiments were conducted under identical experimental conditions against other mainstream attention mechanisms such as SE, HAM, and AFF. All experiments were systematically evaluated on the VisDrone2019 dataset.
Table 4 provides a detailed summary of various performance metrics after integrating each attention mechanism into the model.
Experimental results indicate that although GCNet has a slightly deeper architecture than the SE mechanism, it significantly outperforms mainstream attention mechanisms such as SE in both detection accuracy and real-time performance, demonstrating superior comprehensive performance. Furthermore, compared to the baseline model, the lightweight GCNet introduces only a marginal increase in the number of parameters, which remains within an acceptable range. These findings sufficiently demonstrate that the proposed lightweight GCNet can effectively capture long-range dependencies in complex backgrounds, thereby markedly enhancing robustness in detecting occluded targets.
4.6. Federated Learning Performance Under Data Heterogeneity
Under homogeneous data conditions (
Table 5), our federated framework exhibits robust knowledge aggregation capabilities. The global model attains 44.2% mAP@0.5—comparable to the local models (43.6–44.5%)—while demonstrating superior performance in the more stringent mAP@0.5:0.95 metric (29.5% vs. 28.5–29.1%). This represents a 0.4–1.0% improvement in localization precision, underscoring that collaborative learning enhances model robustness even in data-homogeneous environments, where local models already exhibit strong performance.
Under heterogeneous conditions with severe data skew (
Table 6), the federated framework displays remarkable resilience to distributional disparities. Local models trained on skewed distributions exhibit substantial performance fragmentation, with mAP@0.5 varying dramatically from 37.9% to 59.2%. In contrast, the global model achieves a consistent 43.6% mAP@0.5, surpassing three of the local UAV models by substantial absolute margins of 3.9–5.7%, while maintaining balanced precision (64.2%) and recall (31.7%). The 28.1% mAP@0.5:0.95 further attests to enhanced localization stability amid challenging data heterogeneity. Notably, while the proposed framework demonstrates strong performance in addressing statistical heterogeneity (Non-IID data), evaluating its robustness under adversarial conditions (e.g., Byzantine attacks) will constitute a critical extension for future research, particularly for deployment in safety-critical scenarios.
To further validate the superiority of our framework, we compared FedGTD-UAVs with two classic federated learning algorithms under the same Non-IID setting. As shown in
Table 7, FedAvg and FedProx only achieved marginal improvements over isolated local training and performed poorly when dealing with severe data heterogeneity issues. This is because they aggregate the entire model, and under non-IID data, such an aggregation approach may lead to model distortion due to client drift. In contrast, our method achieved a significantly higher mAP@0.5, outperforming FedAvg by 3.5% and FedProx by 3.9%. Moreover, since our method only requires the transmission of a small subset of critical parameters, it is much better suited for bandwidth-constrained UAV networks compared to transmitting the full model.
As illustrated in
Figure 8, our framework attains significantly accelerated convergence, reaching optimal performance in 200 epochs—33% faster than centralized training. This efficiency arises from parallelized learning across UAV clients, where local updates generate momentum during global aggregation, thereby preserving robustness against data heterogeneity.
Table 8 underscores a key advantage of our federated approach: superior localization precision, as evidenced by 29.5% mAP@0.5:0.95 under IID conditions—a 10.5% relative improvement over centralized training—while maintaining comparable classification confidence.
These results substantiate that, notwithstanding non-IID discrepancies, the framework sustains performance levels comparable to those under IID conditions, thereby demonstrating unprecedented robustness for perception tasks in dynamic environments.
4.7. Benchmarking Against State-of-the-Art Methods
The category-specific analysis in
Table 9 demonstrates substantial improvements in safety-critical classes, including trucks (+15.9% mAP@0.5), motorcycles (+15.8%), and pedestrians (+15.1%). These enhancements stem from our architectural innovations: SPD-Conv bolsters small-target detection, GCNet mitigates occlusion-related ambiguities, and FTL facilitates effective cross-domain adaptation.
Comprehensive benchmarking in
Table 10 establishes new performance standards: 44.2% mAP@0.5 on VisDrone2019 (+12.1% vs. YOLOv8s) and 73.2% mAP@0.5:0.95 on CARPK (+7.2% vs. state-of-the-art), achieved with a 54.7% reduction in computational requirements compared to YOLOv7.
Figure 9 illustrates the consistent superiority of our approach across varying confidence thresholds, maintaining over 60% precision at 70% recall—a crucial attribute for safety-sensitive UAV operations, where minimizing false negatives is paramount.
To provide an intuitive visual assessment, we present qualitative comparisons of our proposed FedGTD-UAVs model against leading detection models on representative scenes from the VisDrone2019 dataset. These encompass varying lighting conditions (e.g., bright daylight and low-light nighttime) and occlusion levels (e.g., minimal and dense). The results are illustrated in
Figure 10,
Figure 11 and
Figure 12.
Across these comparisons, FedGTD-UAVs consistently outperforms baselines. In bright daylight with minimal occlusion (
Figure 10), our model demonstrates superior target localization. Particularly in the right half of the figure, other models may suffer from missed or false detections, while our model significantly outperforms them in terms of accuracy. This is likely attributable to SPD-Conv’s enhancement of spatial details and GCNet’s modeling of global context. Under low-light nighttime conditions (
Figure 11), where baselines exhibit degraded performance due to blurred features, FedGTD-UAVs maintains robust detection, benefiting from federated transfer learning’s cross-domain knowledge fusion for generalized representations. For instance, on the highway in the middle of the figure, the accuracy of vehicle detection by our model is significantly higher than that of other models. In densely occluded scenes with mixed lighting (
Figure 12), special attention can be paid to the areas at the edges of the figure that are prone to being occluded by trees or buildings. Our model can still demonstrate its advantages in such cases. Our approach minimizes missed and false detections through synergistic module interactions: SPD-Conv preserves critical details, GCNet resolves contextual ambiguities, and the federated framework enables efficient knowledge distillation. These results underscore the model’s efficacy in addressing real-world UAV detection challenges.
These results substantiate the efficacy of the three proposed enhancements: SPD-Conv’s spatial detail preservation, GCNet’s global context modeling, and federated transfer learning’s cross-domain knowledge fusion. Consequently, the FedGTD-UAV system demonstrates exceptional adaptability in reasoning about target occlusions under varying lighting conditions and complex environments, making it an ideal solution for real-time object detection in dynamic UAV applications such as road surveillance.
5. Conclusions
This study demonstrates that the proposed FedGTD-UAV framework represents a significant advancement over pre-existing methods in federated learning for UAV swarms. By systematically addressing the key limitations of prior works—namely, their susceptibility to non-IID data degradation, inefficiency in detecting small targets, and lack of a privacy-preserving collaborative mechanism—our work provides a more robust, accurate, and practical solution for real-world deployment. The framework is specifically designed to address three key challenges in UAV swarm perception: reliable detection of small targets, robustness to occlusions, and resilience to data heterogeneity. By integrating SPD-Conv, GCNet, and federated transfer learning, our architecture delivers superior performance in aerial object detection. Specifically, SPD-Conv enhances detection accuracy by preserving critical spatial details of small targets, while GCNet strengthens robustness against occluded objects through contextual reasoning that effectively captures long-range dependencies. Furthermore, the federated transfer learning mechanism enables efficient knowledge sharing across the swarm, thereby overcoming inherent data heterogeneity issues in distributed systems.
Extensive evaluations demonstrate that FedGTD-UAV sets new benchmarks across multiple metrics. On the VisDrone2019 dataset, it achieves 44.2% mAP@0.5—a 12.1% improvement over YOLOv8s—and maintains over 98% performance retention under extreme non-IID data conditions, significantly outperforming the 56.2% performance degradation observed in isolated training scenarios. The framework also supports real-time inference at 217 FPS with 57.6 GFLOPs, yielding a 3.2× better accuracy-efficiency tradeoff. Furthermore, federated aggregation enhances localization precision to 29.5% mAP@0.5:0.95, surpassing centralized methods and acting as an effective regularizer.
Future research efforts will focus on the following key areas: First, we will explore dynamic model pruning for bandwidth-limited swarm communications, Byzantine-robust aggregation mechanisms for contested environments, and cross-modal federated learning incorporating infrared and SAR sensors to enhance the robustness and efficiency of algorithms in complex scenarios. Second, we will combine real-world data collection with synthetic data generation techniques to build multimodal datasets covering diverse weather conditions such as rain, fog, and nighttime, which will be used to train and validate more generalizable models. Third, we will conduct large-scale UAV swarm simulations and ultimately deploy and validate the technology on physical drone platforms to promote its application in mission-critical scenarios like disaster response and perimeter security—where privacy-preserving collaborative perception is essential for success in dynamic operational environments.