This section evaluates the performance of the proposed CM-BRF-ViT framework through extensive experiments conducted on the UAVIDS-2025 and Cyber-Physical datasets. The evaluation focuses on detection accuracy, robustness against Byzantine clients, cross-modal fusion effectiveness, and component-level contributions, as assessed through ablation studies.
4.1. Evaluation Setup and Metrics
The experimental evaluation follows a federated learning setup with K = 10 participating clients and up to 50 communication rounds. The UAVIDS-2025 [
2] dataset represents cyber-level UAV network traffic, while the Cyber-Physical dataset includes WiFi and network-level attack scenarios. Detection performance is evaluated using accuracy, F1-score, and Area Under the Curve (AUC). To assess adversarial resilience, Byzantine behavior is simulated by injecting malicious client updates through label-flipping and gradient-noise attacks at varying client ratios (0–40%). All reported results are averaged over multiple runs to ensure stability.
In the federated learning setup, both the cyber and cyber-physical datasets are partitioned across multiple UAV clients to simulate a realistic distributed environment. Each UAV is treated as an independent federated client and is assigned a local subset of the data corresponding to its operational observations. The data distribution across clients is non-IID, reflecting heterogeneous traffic patterns, sensor readings, and mission conditions encountered by different UAVs. No data samples are shared among clients, and each UAV performs local training on its assigned subset before transmitting model updates to the central server.
All experiments were implemented using Python 3.12.12 within a controlled and unified software environment to ensure full reproducibility and fair comparison. Both the proposed model and baseline methods were implemented using PyTorch 2.9.0 (CUDA 12.6). Supporting libraries, including NumPy 2.0.2, Scikit-learn 1.6.1, and Matplotlib 3.10.0, were employed for data preprocessing, performance evaluation, and result visualization. Training and evaluation were conducted on a GPU-enabled computing platform under identical software and hardware conditions.
The cyber-physical dataset used in this study is distinct from UAVIDS-2025 and consists of WiFi-level and cyber-physical telemetry features collected under both normal operation and cyberattack scenarios. This dataset is publicly available and was originally introduced in [
28], which provides detailed information on data collection, feature definitions, and attack scenarios. In contrast, UAVIDS-2025 [
2] is used exclusively for cyber-layer UAV network traffic analysis.
4.2. Results on UAVIDS-2025 Dataset
4.2.1. Detection Performance Evaluation
Table 1 summarizes the detection performance on the UAVIDS-2025 dataset. The proposed CM-BRF-ViT framework achieves 97.1% accuracy, significantly outperforming conventional federated baselines. The high detection accuracy indicates that the GAF-based ViT representation effectively captures discriminative network traffic patterns. Compared with FedAvg-based ViT models, CM-BRF-ViT provides an absolute improvement of approximately five percentage points, confirming that attention-based global modeling is beneficial for UAV intrusion detection. The confusion matrix analysis reveals a low false-negative rate, which is particularly important for UAV security scenarios where missed attacks can lead to severe operational consequences. The stable convergence across communication rounds further demonstrates that federated training does not degrade the ViT model’s discriminative capacity when combined with robust aggregation.
Figure 2 presents the comprehensive experimental results of CM-BRF-ViT on the UAVIDS-2025 benchmark dataset, including federated learning convergence, Byzantine-robust client filtering, final performance metrics, and binary-classification confusion matrices.
The inference pseudocode demonstrates a structured and principled approach to processing heterogeneous UAV telemetry and cyber data. By normalizing features, transforming them into GASF images, encoding them with a shared ViT, and applying a learned cross-modal fusion mechanism, the model delivers robust, fully integrated intrusion prediction. The threshold-aware decision rule further ensures applicability in safety-critical UAV environments, where calibrated outputs and transparent decision boundaries are essential.
Figure 2 presents a detailed evaluation of the proposed CM-BRF-ViT model on the UAVIDS-2025 dataset, using both raw confusion-matrix counts and normalized percentage-level classification outcomes. Together, these visualizations illustrate the model’s robustness in distinguishing between benign (Normal) and malicious (Attack) UAV activities under realistic operational conditions. The left panel reports the absolute prediction counts across the two binary classes. The model correctly classifies: 3834 Normal samples as Normal, and 14,335 Attack samples as Attack.
Only 83 Normal samples were misclassified as Attack (false positives), and 74 Attack samples were misclassified as Normal (false negatives). The minimal number of false negatives is particularly relevant in intrusion detection, where failing to detect an actual attack is significantly more costly than a false alarm. These raw counts demonstrate the model’s capacity to maintain very low error rates across a large test population. The inference procedure of the proposed CM-BRF-ViT model is summarized in Algorithm 1.
| Algorithm 1: Inference Procedure of CM-BRF-ViT
|
| Step | Description |
| Input | |
| Output | |
| 1 | Normalize cyber and cyber-physical features. |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | is provided) |
| Return | |
The right panel of
Figure 2 presents normalized percentages, enabling comparisons independent of class imbalance. The model achieves 97.88% true-positive recognition of Normal behavior, with only 2.12% false alarms, 99.49% true-positive recognition of Attacks, with only 0.51% missed detections. The near-perfect classification of attack behaviors highlights the effectiveness of the cross-modal fusion mechanism and the rich GASF–ViT representation, allowing the model to capture subtle deviations in both cyber and cyber-physical channels. The very low false-negative rate further supports the model’s suitability for safety-critical UAV environments.
Figure 3 provides a comprehensive analysis of the attack-probability behavior of the proposed CM-BRF-ViT model across different UAVIDS-2025 attack categories. The four subfigures collectively demonstrate the model’s statistical reliability, its discriminative sharpness between benign and malicious behavior, and its robustness across heterogeneous attack types. The density plot (top left) illustrates a distinct bimodal separation between Normal and Attack samples. Normal instances cluster sharply near zero probability, while Attack instances are concentrated near one, with almost no overlap between the two distributions. The vertical threshold line at 0.5 highlights that the model’s natural probability separation aligns perfectly with the canonical decision boundary, indicating extremely low prediction ambiguity, high confidence for both classes, and strong calibration of the fused probability output.
This clear separability is a strong indicator of effective multimodal representation and successful cross-modal fusion. The box-and-whisker visualization (top right) compares predicted attack probabilities across major UAVIDS-2025 attack categories, including Blackhole, Flooding, Normal Traffic, Sybil, and Wormhole behaviors. All attack categories exhibit high median probabilities near 1.0, indicating consistent detection performance across diverse adversarial patterns. Standard Traffic samples maintain probabilities near 0, demonstrating robust avoidance of false positives. The narrow interquartile ranges for most attack types suggest low variance and high certainty, even in scenarios where signal characteristics vary widely. Occasional outliers are present but remain well above the decision threshold, indicating resilience against noisy or atypical attack signals. Overall, this plot highlights that the model generalizes effectively across multiple attack families without mode collapse or class-specific bias. The mean class probabilities (bottom-left) panel summarizes the central tendency and variability of the predicted probabilities across attack categories. Each class maintains a high mean probability for attacks and a low mean probability for Normal instances, with controlled variance, reflecting stable model behavior.
These results align with expectations for a well-calibrated classifier that maintains consistent confidence across heterogeneous temporal and tabular input patterns. The balance between high accuracy and controlled variance is critical for real-world UAV intrusion detection, where uncertainty must be minimized. The ROC curve (bottom right) demonstrates near-perfect separability between the two classes, achieving an AUC of 0.999. This exceptional performance indicates outstanding sensitivity to attack conditions (a high true-positive rate) and strong specificity against false alarms (a low false-positive rate).
The curve remains close to the upper-left corner across the entire threshold spectrum, confirming the model’s robustness to changes in operating thresholds. Such behavior is essential for deployment across UAV systems with varying tolerance levels for false positives.
Figure 4 presents the federated learning performance of the cyber-physical modality within the CM-BRF-ViT framework. The results capture (
Figure 4A) the convergence behavior of validation and test accuracy, (
Figure 4B) the ReGCA-based client filtering behavior, and (
Figure 4C) the final predictive performance achieved after 10 communication rounds. Together, these plots illustrate the learning dynamics, robustness, and generalization capabilities of the cyber-physical pathway. The accuracy curves (top panel) demonstrate a smooth, stable convergence trajectory across the communication rounds. Accuracy increases from approximately 67–68% in the first round to 72% by round 3, indicating that the cyber-physical modality quickly benefits from shared global knowledge even with decentralized data. Subsequent rounds continue to improve performance, reaching 77.9% validation and 78.5% test accuracy by the final round. The close alignment between the two curves suggests that the model generalizes well and does not overfit or exhibit instability across communication rounds. These trends confirm that the cyber-physical features—though inherently noisier and more variable than purely cyber signals—can be effectively learned in a federated setting using ViT-based architecture. The ReGCA filtering heatmap (Bottom left) indicates that all participating clients remain classified as reliable across all communication rounds. No clients are marked as dropped (red), demonstrating high consistency between local model updates and the trusted server-side reference distribution.
The temporary accuracy drop around communication round 8 is caused by increased inter-client heterogeneity, particularly due to the integration of cyber-physical modality updates with higher variance. This fluctuation reflects the robustness-oriented filtering behavior of the proposed ReGCA aggregation. This stability suggests that the cyber-physical dataset used by each client shows no adversarial poisoning or severe anomalous deviations that violate the ReGCA thresholds. The filtering mechanism serves as a safeguard, ensuring that Byzantine or noisy clients, if present, do not propagate harmful updates to the global model.
The absence of dropped clients validates both the integrity of the dataset and the robustness of the federated update process. The bar plot (bottom right) summarizes the final federated performance: 77.9% validation accuracy and 78.5% test accuracy. These values demonstrate that the cyber-physical modality alone is moderately predictive but less discriminative than the fused cross-modal model. This reinforces the paper’s central hypothesis: cyber-physical signals provide complementary information but require fusion with cyber features to achieve high-performance intrusion detection. Notably, the consistency between validation and test accuracy again highlights strong generalization and the absence of overfitting.
Figure 5 presents the binary classification performance of the proposed CM-BRF-ViT model when applied exclusively to the cyber-physical modality of the UAVIDS-2025 dataset. The two subfigures report (Left) raw confusion matrix counts and (Right) normalized percentage-level performance, providing complementary perspectives on the model’s behavior under single-modality evaluation.
Figure 5 shows that the cyber-physical modality provides reliable but not fully sufficient discrimination for intrusion detection. The model maintains low false-positive rates, which is critical for operational UAV systems where false alarms can trigger unnecessary evasive actions or flight interruptions. The increased false-negative rate relative to the cross-modal model confirms the need to integrate cyber features to capture attack signatures that are invisible to physical telemetry alone. These findings validate the design choice behind the CM-BRF-ViT architecture: cyber-physical data provide valuable yet incomplete signals, and optimal performance emerges only when combined with cyber-modality information via a learnable fusion mechanism.
4.2.2. Cross-Modal Fusion Effectiveness
Figure 6 provides an in-depth examination of the attack-probability behavior of the cyber-physical modality across multiple UAVIDS-2025 attack categories. The four subfigures collectively offer insights into separability between normal and attack behavior, inter-class variability, model calibration, and threshold-based discriminative performance.
The density plot (left) reveals a strongly bimodal distribution of predicted attack probabilities. Standard samples cluster distinctly near the lower end of the probability spectrum (close to 0), while attack samples concentrate sharply around 1.0. Notably, there is minimal overlap between the two distributions, indicating high discriminative clarity. The standard decision threshold of 0.5 (dashed line) aligns precisely with the gap between the two modes. The distribution suggests the model is well calibrated for the cyber-physical channel, despite the inherent noise and variability of sensor-driven features. This confirms that the learned representation effectively captures modal differences between benign flight behavior and adversarial sensor manipulation.
The boxplot (right) compares attack probabilities across several cyber-physical classes: DoS, FDI, Replay, benign, and Evil-Twin-like behaviors. The following patterns emerge: Attack classes (DoS, FDI, Replay, Evil Twin) exhibit median probabilities near 1.0, indicating consistent model confidence across diverse adversarial patterns.
Benign samples retain low median values, with the vast majority falling below the 0.5 decision threshold. Variability in the Replay and FDI classes reflects the temporal and physical complexity of these attack patterns, yet they remain well above the attack threshold.
Outliers are present but do not threaten classification reliability, as they remain clearly separated from the benign distribution.
This plot reinforces the model’s capacity to adapt to multiple cyber-physical attack signatures without degrading performance across classes.
Figure 7 presents a detailed evaluation of the cyber-physical branch of the proposed intrusion detection framework. The mean class probability analysis (left) illustrates that attack classes—DoS, FDI, Replay, and Evil Twin—exhibit consistently higher mean predicted attack probabilities than the benign class. This separation demonstrates that the model effectively captures modality-specific behavioral signatures encoded in cyber-physical telemetry. Although certain attack types (e.g., Replay and Evil Twin) exhibit wider confidence intervals, reflecting their inherent variability, their probability distributions remain distinctly elevated relative to those of benign samples. This indicates robust decision boundaries and low confusion between benign and malicious UAV states.
The ROC curve (right) further quantifies binary detection performance, achieving an AUC of 0.975, which denotes excellent discriminative capability. The curve’s steep rise near the upper-left region indicates high true-positive rates at very low false-positive levels. This is a critical property in UAV security scenarios, where false alarms can disrupt mission autonomy and missed detections may compromise operational safety. The near-perfect AUC demonstrates that even without multimodal fusion, the cyber-physical feature stream alone provides strong separability between benign and attack conditions.
Overall,
Figure 7 confirms that the cyber-physical modality provides highly reliable, well-calibrated detection signals, reinforcing its role as an essential component of the CM-BRF-ViT intrusion detection architecture.
Figure 8 presents a comprehensive evaluation of the cross-modal fusion mechanism, demonstrating that the proposed learnable fusion strategy integrates UAV-side cyber signals with cyber-physical telemetry to achieve superior intrusion-detection performance compared with unimodal or fixed-weight baselines.
The cross-modal attack probability space (top left) clearly illustrates the complementary nature of the two modalities. While some samples exhibit high attack probability in only one branch, actual attack instances typically cluster near the top-right region, where both modalities assign high likelihood. This indicates that the UAV-only and cyber-physical-only predictors capture distinct but synergistic aspects of malicious behavior, reinforcing the motivation for learnable fusion. Conversely, benign samples are densely concentrated in the lower-left region, indicating consistent agreement between modalities in normal operational states.
The ROC comparison (top center) further quantifies this complementarity. The UAV-only and CP-only branches achieve AUC values of 0.914 and 0.874, respectively. Although the fixed-fusion baseline (α = 0.5) achieves a very high AUC of 0.996, the learnable cross-modal fusion maintains a competitive AUC of 0.993 while offering better adaptability across operating points. However, the proposed learnable fusion mechanism achieves the highest AUC of 0.993, reflecting its ability to dynamically weight modality contributions based on the statistical evidence present in each sample. This substantial improvement confirms the effectiveness of incorporating cross-modal interactions rather than treating modalities independently.
The fusion method comparison bar chart (top right of subplot cluster) provides a direct visual summary of these findings, with learnable fusion outperforming all alternative approaches. This highlights the model’s ability to exploit nonlinear dependencies between cyber and cyber-physical representations—an ability that fixed or unimodal methods lack.
The precision–recall curve (bottom left) shows that the fused classifier achieves a near-perfect average precision (AP = 0.995), markedly surpassing the UAV-only branch (AP = 0.955). This is particularly important in UAV intrusion scenarios, where the imbalance between benign and malicious traffic can inflate ROC-based metrics; PR curves provide a more sensitive view of precision under such conditions. The near-vertical shape of the fused PR curve demonstrates excellent precision retention even at high recall levels.
The learnable fusion output distribution (bottom center) exhibits a sharply bimodal probability structure, with benign samples tightly clustered near zero and attack samples overwhelmingly concentrated near one. This indicates that the fused classifier produces well-calibrated decision outputs with minimal class overlap—an essential property for operational UAV intrusion detection systems, where uncertainty must be minimized.
Finally, the fused classifier’s confusion matrix (top right) quantitatively confirms this behavior. The model yields very low false-positive and false-negative counts, correctly identifying 1507 attack instances while misclassifying only a small fraction of standard samples. These results collectively demonstrate that the proposed learnable fusion method not only integrates cross-modal information effectively but also produces substantially more reliable and discriminative predictions than any unimodal or static fusion alternative.
Overall,
Figure 8 establishes that cross-modal learning with a trainable fusion module is critical for maximizing detection robustness and highlights the intrinsic complementarity between cyber and cyber-physical feature spaces in UAV intrusion detection.
4.2.3. Ablation and Sensitivity Analysis
Figure 9 presents a comprehensive assessment of the model’s component-wise contributions and its Byzantine-robustness characteristics across increasingly adversarial federated learning settings. Together, these results validate both the architectural design choices of the CM-BRF-ViT framework and the resilience of the proposed ReGCA aggregation strategy. The ablation results (top panel) quantify the incremental performance gains achieved by each architectural element. The baseline FedAvg + MLP configuration exhibits the lowest performance, particularly in the F1-score and AUC, underscoring the limitations of shallow classifiers for UAV intrusion features. Incorporating the Vision Transformer (FedAvg + ViT) yields substantial improvements across all metrics, confirming the importance of transformer-based temporal–spatial encoding.
Introducing cross-modal fusion (FedAvg + ViT + Fusion) provides an additional boost, demonstrating that integrating cyber and cyber-physical modalities leads to a more discriminative and robust feature space. The ReGCA + ViT (Single) configuration further improves performance, underscoring the benefits of Byzantine-robust aggregation even without cross-modal fusion. The complete CM-BRF-ViT model achieves the highest scores—96.2% accuracy, 96.0% F1-score, and 96.8% AUC—demonstrating that the synergy of ReGCA, ViT encoding, and cross-modal fusion forms the most effective architecture.
Overall, the ablation results indicate that each component contributes meaningfully, and full integration yields the strongest intrusion-detection capability.
The robustness experiment (bottom left) examines how test accuracy degrades as the proportion of malicious (Byzantine) clients increases. FedAvg suffers rapid, monotonic degradation, collapsing to 45.2% accuracy at 40% adversarial participation. Trimmed Mean and BDRFA demonstrate moderate resilience but still exhibit notable decreases at high Byzantine ratios.
In contrast, the proposed ReGCA method maintains high stability, achieving 89.6% accuracy even when 40% of clients are adversarial. This indicates that ReGCA effectively suppresses manipulated updates while preserving helpful client contributions, ensuring consistent model performance in hostile federated environments typical of UAV networks.
The divergence between ReGCA and other baselines widens as the threat level increases, confirming that standard aggregation rules are insufficient for UAV systems, which are vulnerable to coordinated poisoning attempts.
The degradation analysis (bottom right) quantifies the relative performance drop at a 40% Byzantine ratio. FedAvg exhibits catastrophic vulnerability with a 46.9% accuracy loss, whereas Trimmed Mean and BDRFA reduce the degradation to 19.4% and 12.8%, respectively. ReGCA demonstrates exceptional robustness, with only 6.6% degradation, making it the only method capable of maintaining high performance under severe adversarial conditions.
This result highlights the practical significance of ReGCA for real-world UAV deployments, where communication links and clients cannot be fully trusted. By isolating inconsistent or malicious updates through reliability-aware scoring, ReGCA safeguards model integrity and prevents system-wide collapse.
4.2.4. Byzantine Robustness Analysis
Figure 10 presents a unified overview of the experimental performance of the proposed Cross-Modal Byzantine-Robust Federated Vision Transformer (CM-BRF-ViT) across the UAVIDS-2025 and Cyber-Physical datasets, demonstrating its learning dynamics, classification reliability, and fused cross-modal decision behavior. Panel (A) illustrates the convergence behavior of federated training for both datasets. UAVIDS-2025 exhibits rapid improvement during early communication rounds, stabilizing above 97% accuracy by round 10. The Cyber-Physical dataset follows a smoother trajectory, reaching approximately 78% accuracy. This contrast reflects intrinsic modality differences, yet both curves demonstrate stable training without oscillations—an indication that the proposed ReGCA aggregation effectively suppresses noisy or adversarial updates. Panel (B) compares ROC curves for UAV-only, cyber-physical-only, and random baselines. The UAV modality achieves the highest discrimination capability, with an AUC of 0.994, reflecting the strong separability of attack patterns in UAV telemetry. The cyber-physical modality achieves an AUC of 0.974, confirming the model’s generalization across heterogeneous feature sources. Both far exceed the random baseline (AUC = 0.501).
The ROC curves further indicate that ViT encoders paired with ReGCA aggregation achieve near-optimal detection performance under federated constraints.
UAVIDS-2025 achieves extremely low false-negative (74) and false-positive (83) rates despite large sample sizes, confirming strong sensitivity and specificity.
Cyber-physical results show slightly higher false-negative rates, consistent with the less structured nature of physical sensor traces.
Overall, both unimodal classifiers produce reliable predictions that serve as robust inputs to the cross-modal fusion stage.
Panel (E) demonstrates the effect of learnable cross-modal fusion. The fused classifier substantially reduces misclassification relative to the unimodal systems, yielding only 3 false positives and 20 false negatives. This represents a significant improvement in both precision and recall, illustrating that cyber and cyber-physical cues are complementary and mutually reinforcing when processed jointly.
Panel (F) shows the probability density of fused attack predictions. The distribution is distinctly bimodal, with normal scores concentrated near 0.2 and attack scores near 1.0, separated by a wide margin around the decision threshold (0.5). This sharp separation indicates high confidence and low model uncertainty, confirming that the fusion layer effectively integrates multimodal evidence into a stable, well-calibrated decision boundary.
Panel (G) evaluates the robustness of the proposed ReGCA aggregation mechanism under increasing proportions of Byzantine (malicious) clients, comparing performance against the standard FedAvg baseline.
FedAvg exhibits monotonic degradation, dropping from ≈approximately 95% accuracy to 45.2% when 40% of participating clients are adversarial. This sharp decline highlights FedAvg’s vulnerability to poisoned or inconsistent updates. ReGCA (Ours) consistently maintains performance above 89%, even at the highest Byzantine ratio tested (40%). The near-flat performance curve of ReGCA demonstrates strong resistance to gradient-poisoning attacks, effective filtering of anomalous updates, and stable global convergence under adversarial pressure.
Overall, the results confirm that ReGCA provides substantial.
Resilience: outperforms FedAvg by over 44 percentage points at a 40% Byzantine ratio, making it a suitable choice for safety-critical UAVs and cyber-physical systems.
Panel (H) summarizes four core performance dimensions of the CM-BRF-ViT framework and UAVIDS Accuracy: 97.1%.
This demonstrates strong detection capability in UAV telemetry data, benefiting from both GASF encoding and ViT-based feature extraction. Cyber-Physical Accuracy: 78.5%. This highlights effective generalization to heterogeneous sensor-based intrusion scenarios where attack signatures are more subtle and less structured. Fused AUC: 99.3%.
The extremely high AUC supports the advantage of learnable cross-modal fusion, which leverages complementary evidence across cyber and physical modalities. Byzantine robustness at 40%: 89.6%.
Indicates that the full CM-BRF-ViT pipeline—combining ViT encoders, fusion layers, and ReGCA aggregation—maintains high predictive quality even under severe adversarial contamination. Collectively, these results show that CM-BRF-ViT achieves state-of-the-art multimodal intrusion-detection accuracy while demonstrating exceptional robustness against adversarial clients in federated learning settings.
Table 1 demonstrates that CM-BRF-ViT consistently outperforms all baseline federated intrusion detection models across four primary evaluation criteria: cyber-layer accuracy (UAVIDS), cyber-physical accuracy, fused AUC, and Byzantine robustness.
CM-BRF-ViT achieves 97.1% accuracy, outperforming FedAvg + ViT by +5.0 percentage points. This improvement validates the contribution of GAF-based visual encoding and ViT’s long-range dependency modeling.
By achieving 78.5%, the model surpasses all averaging-based methods, demonstrating that semantic consistency constraints in ReGCA help stabilize representations, even for subtle physical-layer anomalies. Near-perfect fusion AUC = 0.993:
The fused classifier yields almost ideal separability between attack and benign samples.
The +0.121 AUC gain over FedAvg + MLP highlights the strength of adaptive cross-modal fusion. With 89.6% accuracy at 40% malicious clients, CM-BRF-ViT outperforms FedAvg by over 44 percentage points and Trimmed Mean by 17.2 points, demonstrating the effectiveness of joint prediction–feature consistency scoring.
CM-BRF-ViT offers a comprehensive advantage, combining strong predictive performance, strong cross-modal generalization, and exceptional adversarial robustness. None of the baseline approaches delivers strong results across all criteria simultaneously.
Table 2 provides a detailed robustness analysis under increasing proportions of Byzantine adversaries. The results clearly indicate that ReGCA significantly outperforms both FedAvg and Trimmed Mean in all adversarial settings. At 0% adversaries: ReGCA improves accuracy to 97.1%, confirming that robust aggregation does not harm performance even without adversarial pressure. 10–20% Byzantine clients:
While FedAvg collapses rapidly (85.4% → 78.5%), ReGCA maintains over 94% accuracy, illustrating resilience to moderate poisoning. When there is 40% adversarial participation, this is the most striking scenario: FedAvg collapses to 45.2%, essentially unusable; Trimmed Mean degrades to 72.4%, showing partial protection; ReGCA achieves 89.6%, maintaining high reliability; and FedAvg aggregates updates agnostically and is easily poisoned.
Trimmed Mean removes extreme gradients but fails against feature-level inconsistency attacks. ReGCA uses dual consistency metrics (prediction + embedding space), ensuring that even semantically incorrect updates are filtered. MAD normalization provides robustness against colluding adversaries.
ReGCA demonstrates state-of-the-art Byzantine resilience, maintaining operational reliability even when nearly half of the participating clients are malicious.
4.3. Performance Comparison
Table 3 provides a structured overview of representative intrusion detection approaches developed for UAV and federated learning environments over the past five years. The studies included cover a broad methodological spectrum, ranging from centralized CNN-based classifiers and distilled lightweight models to federated multi-scale attention architectures.
Many recent UAV intrusion detection studies (e.g., [
14,
26]) assume access to a centralized dataset. This assumption is inherently inconsistent with realistic UAV deployments, in which data are distributed, sensitive, and often bandwidth-limited. As shown in
Table 3, only a small subset of studies adopt federated paradigms [
18], yet even these do not consider adversarial resilience or cross-modal feature fusion. Absence of Byzantine Robustness: None of the surveyed UAV-specific or IoT-driven IDS models integrates robust aggregation or adversarial defense mechanisms. Even FL-based IDS relies on the FedAvg paradigm, which is notoriously vulnerable to model poisoning and label flipping. As UAV swarms operate in contested environments and may be targeted by rogue nodes, the absence of Byzantine-tolerant design is a critical research gap.
Existing work considers either cyber traffic (e.g., network logs) or cyber-physical data (e.g., sensor readings), but few integrate both modalities. Modalities in UAV systems often carry complementary information. Cyberattacks manifest as packet-level anomalies; physical attacks manifest as flight-behavior deviations. To assess robustness against Byzantine attacks, the proposed CM-BRF-ViT framework is compared with state-of-the-art federated aggregation strategies, including standard FedAvg, Trimmed Mean, and representative Byzantine-robust aggregation paradigms reported in the literature. These include secure aggregation approaches such as SEAR [
21], which leverages trusted execution environments to protect client model privacy while enabling Byzantine resilience, and similarity-based poisoning attacks exemplified by Sine [
22], which exploits vulnerabilities in cosine similarity to amplify model poisoning. All methods are evaluated under identical adversarial settings with varying ratios of malicious clients. The table highlights that no prior work jointly models these modalities through learnable representations. The proposed CM-BRF-ViT fills these omissions through four key innovations:
Cross-modal GAF representations unify cyber and cyber-physical signals into a consistent visual encoding space—an ability not present in earlier studies. ReGCA-based semantic-level consistency enhances the reliability of federated training by aligning both predictive distributions and latent features across clients.
Byzantine-robust fusion + aggregation ensures stable performance even under 30–40% malicious participation, a robustness not provided by any prior UAV IDS model.
Transformer-based modeling (ViT) provides an expressive backbone for learning modality interactions at scale.
Thus,
Table 3 situates your work as the first to jointly address privacy, cross-modality, and Byzantine resilience for UAV intrusion detection.
Table 4 summarizes prior research on transformer-based intrusion detection, GAF-based feature encoding, and Byzantine-robust federated learning. The comparison reveals apparent methodological fragmentation across three domains—network IDS, time-series conversion, and robust FL—none of which simultaneously address the multi-modal and adversarial challenges inherent to UAV systems. Works such as [
23] or [
24] demonstrate the growing popularity of transformer backbones in intrusion detection. However, all models in this category operate under strictly centralized, single-data-source, and non-federated assumptions. Consequently, they are not directly applicable to UAV networks where decentralization and modality heterogeneity are intrinsic. Studies such as [
27,
29] successfully leverage GAF transformations but entirely ignore federated training and adversarial robustness. These models highlight the strength of image-based encodings but lack mechanisms to ensure trustworthiness in collaborative environments.
Byzantine-robust aggregation methods (e.g., [
8,
31]) provide strong theoretical guarantees but do not operate on cross-modal features, integrate semantic-level consistency, or handle the cyber-physical complexity of UAV systems.
This combination is essential for maintaining reliable collaborative detection across non-IID UAV clients under adversarial risk, yet it is absent in the literature. Relative to the works summarized, CM-BRF-ViT introduces the First cross-modal ViT-based GAF intrusion detection model for UAV cyber-physical environments. First integration of semantic-level Byzantine defense (ReGCA) into a transformer-driven multimodal IDS.
4.4. System Overhead and Deployment Considerations
This subsection provides an analytical discussion of system-level overhead and deployment-related considerations associated with the proposed CM-BRF-ViT framework. The purpose of this analysis is not to claim real-world deployment readiness, but to clarify the expected communication, storage, and scalability characteristics of the method and to delineate the scope of the current study.
4.4.1. Communication Overhead
In the proposed federated learning setup, communication overhead primarily arises from exchanging model updates between UAV clients and the central aggregator in each communication round. Each participating client transmits a serialized model update (or weight difference) to the server, while the server broadcasts the updated global model to the clients.
The total uplink communication cost per round scales linearly with the number of participating clients and the size of the local model update. Similarly, the downlink cost depends on the size of the global model distributed to clients. Since the proposed ReGCA aggregation operates on received updates without introducing additional model parameters, it does not increase the communication payload beyond standard federated aggregation mechanisms. Therefore, the communication complexity of CM-BRF-ViT remains comparable to conventional federated learning baselines under identical model configurations.
4.4.2. Storage Requirements
Client-side storage requirements are dominated by the local model parameters and, when applicable, lightweight optimizer states. No additional long-term storage is required at the client level for ReGCA beyond temporary buffers used during local training. On the server side, storage consists of global model parameters and a small set of aggregation-related statistics used to compute consistency-based weights.
Importantly, ReGCA does not require maintaining historical model ensembles or large client reputation histories, which helps limit server-side storage growth. As a result, both client and server storage requirements remain modest and scale primarily with the base model size rather than the number of clients.
4.4.3. Scalability with Increasing Number of Clients
From a scalability perspective, the communication cost of CM-BRF-ViT increases linearly with the number of participating clients per round, which is consistent with standard federated learning paradigms. The computational overhead introduced by ReGCA is associated with evaluating prediction-level and feature-level consistency for each received update. This computation scales linearly with the number of participating clients and depends on the dimensionality of the learned representations and the size of the trusted reference set.
Because the reference set is fixed and shared across rounds, the additional aggregation cost remains stable across training and does not grow with the total number of communication rounds. This design enables CM-BRF-ViT to scale to moderate-sized UAV swarms without introducing superlinear aggregation complexity.
4.4.4. Behavior Under Packet Loss and Intermittent Participation
Although this study does not include network-level simulations or hardware-in-the-loop experiments, the expected behavior of the proposed framework under realistic network conditions can be discussed qualitatively. In the presence of packet loss or intermittent client participation, some client updates may fail to reach the server in a given communication round. In such cases, aggregation proceeds using the subset of successfully received updates, which is a standard assumption in federated learning systems.
Under reduced client participation, convergence may slow due to fewer updates contributing to the global model. However, the consistency-based design of ReGCA remains applicable, as aggregation weights are computed only for available updates. Consequently, the framework is expected to degrade gracefully in convergence speed rather than exhibit unstable or catastrophic behavior, provided that a sufficient number of benign clients remain active.
4.6. Discussion
The experimental findings demonstrate that CM-BRF-ViT successfully addresses key limitations of existing UAV intrusion detection systems. First, the GAF-ViT architecture enables attention-based global feature modeling without handcrafted feature engineering. Second, the federated learning framework preserves data privacy while maintaining high detection accuracy. Third, the proposed ReGCA mechanism significantly enhances robustness against Byzantine adversaries, a capability absent in most prior UAV IDS solutions.
The performance gap between UAV network traffic and cyber-physical data suggests that further improvements may be achieved by incorporating domain-specific feature enhancement for cyber-physical signals. Nevertheless, the strong cross-modal fusion performance indicates that joint modeling already mitigates this challenge to a large extent. From a deployment perspective, CM-BRF-ViT remains computationally feasible for UAV systems because robustness checks are performed centrally, while local clients execute only lightweight ViT inference.
As observed in
Figure 4A, a temporary decrease in accuracy occurs around communication round 8. This behavior coincides with the stage at which the federated model begins to incorporate a higher degree of inter-client heterogeneity, particularly from the cyber-physical modality. Compared to cyber-only inputs, cyber-physical data exhibit higher variance and noisier feature distributions, which initially challenge the global model’s consistency.
At the same time, the proposed ReGCA aggregation mechanism actively down-weights client updates that demonstrate lower semantic and predictive consistency with the trusted reference set. This selective filtering suppresses inconsistent or potentially malicious updates, which may temporarily reduce validation accuracy. In subsequent communication rounds, as unreliable updates are progressively filtered out and the global model parameters stabilize, the aggregation process converges toward a more robust solution. Consequently, accuracy recovers and continues to improve.
This behavior highlights the robustness-driven nature of ReGCA, where short-term accuracy fluctuations are an expected and acceptable trade-off for long-term stability and resilience under Byzantine participation.