Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection

Batur Şahin, Canan

doi:10.3390/drones10020125

Open AccessArticle

Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection

by

Canan Batur Şahin

Faculty of Engineering and Natural Sciences, Malatya Turgut Özal University, Malatya 44900, Turkey

Drones 2026, 10(2), 125; https://doi.org/10.3390/drones10020125

Submission received: 6 December 2025 / Revised: 23 January 2026 / Accepted: 6 February 2026 / Published: 11 February 2026

(This article belongs to the Section Artificial Intelligence in Drones (AID))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The fusion of cyber and cyber-physical modalities enables high-confidence UAV intrusion detection, providing reliable decision-making for safety-critical aerial missions.
The combination of Vision Transformers, GAF encoding, and Byzantine-robust FL offers a scalable, privacy-preserving solution suitable for real-world UAV swarms operating under adversarial conditions.

What are the implications of the main findings?

The fusion of cyber and cyber-physical modalities enables high-confidence UAV intrusion detection, providing reliable decision-making for safety-critical aerial missions.
The newly introduced ReGCA aggregation method significantly improves federated robustness, maintaining 89.6% accuracy even with 40% Byzantine clients, more than 44 percentage points higher than FedAvg.

Abstract

The increasing deployment of uncrewed aerial vehicles (UAVs) in cyber-physical and safety-critical missions has amplified the need for intrusion detection systems that are accurate, privacy-preserving, and resilient to adversarial manipulation. In this paper, we propose CM-BRF-ViT, a Cross-Modal Byzantine-Robust Federated Vision Transformer framework for UAV intrusion detection that jointly addresses heterogeneous attack modeling, distributed learning security, and adaptive decision fusion. The proposed framework integrates Gramian Angular Field (GAF) transformations with Vision Transformer (ViT) architectures to effectively convert tabular network and cyber-physical features into discriminative visual representations suitable for attention-based learning. To enable privacy-preserving collaboration across distributed UAV nodes, CM-BRF-ViT operates within a federated learning paradigm and introduces Reference-GAF Consistency Aggregation (ReGCA). This novel Byzantine-robust aggregation mechanism jointly measures prediction consistency and feature-level semantic consistency using a trusted reference set and MAD-based robust weighting. Unlike conventional defenses that rely solely on parameter-space filtering, ReGCA supervises model updates at both behavioral and representation levels, significantly enhancing robustness against malicious clients. In addition, a learnable cross-modal fusion head is developed to adaptively combine attack probabilities derived from cyber and cyber-physical modalities, allowing the framework to exploit complementary threat signatures across layers. Extensive experiments conducted on the UAVIDS-2025 and Cyber-Physical datasets demonstrate that the proposed method achieves 97.1% detection accuracy for UAV network traffic and 78.5% for cyber-physical data, with a fused detection AUC of 0.993. Under adversarial settings, CM-BRF-ViT preserves 89.6% accuracy with up to 40% Byzantine clients, outperforming FedAvg by more than 44 percentage points. Ablation studies further confirm that ReGCA, cross-modal fusion, and ViT-based representation learning contribute complementary performance gains over baseline federated and centralized approaches. These results demonstrate that CM-BRF-ViT provides a robust, adaptive, and privacy-aware intrusion detection solution for UAV systems, making it well-suited for deployment in adversarial and resource-constrained aerial networks.

Keywords:

UAV swarm security; federated learning; robust aggregation; vision transformer; blockchain; intrusion detection; cyberattack detection; Gramian angular field; byzantine-robust

1. Introduction

Nowadays, the rapid advancement of Unmanned Aerial Vehicle (UAV) technology has transformed a wide range of application domains, including surveillance, logistics, disaster management, precision agriculture, and military operations [1]. While UAVs offer significant operational advantages, their increasing deployment density and reliance on open wireless communication protocols have substantially expanded their attack surface, making UAV networks attractive targets for cyber adversaries [2]. The distributed nature of UAV swarms, combined with heterogeneous communication links and dynamic network topologies, introduces security challenges that conventional centralized intrusion detection systems (IDSs) cannot effectively address [3].

Recent studies have emphasized the growing severity of cyber threats targeting UAV ecosystems. Attacks such as GPS spoofing, denial-of-service (DoS), flooding, Sybil, wormhole, and false data injection can severely disrupt UAV operations, leading to mission failure, data leakage, physical collisions, and unauthorized surveillance [3,4]. These risks highlight the pressing need for accurate, resilient, and UAV-specific intrusion-detection mechanisms that can operate under dynamic, adversarial conditions.

To address privacy and scalability concerns in distributed UAV environments, Federated Learning (FL) has emerged as a promising paradigm for collaborative intrusion detection [5]. FL enables multiple UAVs or ground control stations to jointly train a global detection model without exchanging raw traffic data, thereby preserving sensitive operational information and reducing communication overhead. However, the decentralized nature of FL makes it inherently vulnerable to Byzantine adversaries, which may inject poisoned or manipulated model updates to degrade global performance [6,7]. Ensuring robustness against such adversarial behavior remains a fundamental challenge for federated UAV security systems [8,9].

In parallel, recent advances in deep learning have demonstrated the effectiveness of transformer-based architectures, particularly Vision Transformers (ViTs), in capturing long-range dependencies through self-attention mechanisms [10]. Among such transformations, the Gramian Angular Field (GAF) provides a mathematically principled method for encoding one-dimensional feature sequences into two-dimensional images, preserving temporal and structural correlations [11]. Existing studies have shown that GAF-based representations, when combined with deep learning, can significantly enhance intrusion detection performance in centralized settings [11]. Deep learning-based UAV intrusion detection systems have also demonstrated strong performance in centralized environments [12].

Intrusion detection remains largely unexplored. Moreover, most existing UAV IDS solutions focus on a single data modality—either network traffic or cyber-physical signals—and neglect the complementary nature of multi-layer attack evidence [13]. Cross-modal fusion has recently been recognized as an effective strategy for improving detection robustness by jointly learning from heterogeneous information sources [13]. Nevertheless, adaptive cross-modal fusion within a Byzantine-robust federated framework for UAV security remains underexplored.

To bridge these gaps, this paper proposes CM-BRF-ViT, a cross-modal, Byzantine-robust federated vision transformer framework for UAV intrusion detection. The proposed approach jointly addresses (i) cross-modal intrusion modeling using cyber and cyber-physical data, (ii) privacy-preserving federated learning with enhanced resilience to Byzantine attacks, and (iii) adaptive fusion of heterogeneous intrusion evidence.

The main contributions of this work are summarized as follows:

We introduce a GAF–ViT-based representation framework that transforms tabular UAV network and cyber-physical features into unified 32 × 32 visual representations, enabling attention-based intrusion detection and achieving 97.1% detection accuracy on the UAVIDS-2025 dataset [2].
We propose Reference-GAF Consistency Aggregation (ReGCA), a novel Byzantine-robust federated aggregation strategy that jointly evaluates prediction consistency and feature-level semantic consistency on a trusted reference dataset. The proposed method maintains 89.6% accuracy even with 40% malicious clients, significantly outperforming classical federated aggregation schemes [6,7,9].
We develop a learnable cross-modal fusion mechanism that adaptively combines cyber and cyber-physical intrusion probabilities, achieving a near-perfect AUC of 0.993, and outperforming fixed-weight fusion baselines, consistent with observations reported in multimodal learning studies [13].
We conduct extensive experimental evaluations and ablation studies on UAVIDS-2025 and Cyber-Physical datasets, demonstrating a cumulative 7.9% performance improvement over baseline federated IDS methods.

Early work in this period primarily focused on deep learning-based centralized intrusion detection, whereas more recent studies have investigated federated learning (FL), transformer architectures, and time-series-to-image encodings. This subsection reviews (i) UAV-specific intrusion detection systems, (ii) federated and distributed IDS for UAV and related aerial networks, (iii) transformer/ViT-based and Gramian Angular Field (GAF)–based intrusion detection, and (iv) Byzantine-robust FL aggregation methods that motivate our proposed ReGCA mechanism.

Several recent studies have proposed deep-learning-based intrusion detection systems explicitly tailored to UAV networks. Whelan et al. provided a comprehensive taxonomy and comparative review of AI-enhanced intrusion detection systems for UAVs, highlighting that most existing approaches rely on centralized training over network traffic and telemetry data, with limited consideration of adversarial robustness and privacy-preserving mechanisms [14]. To address UAV resource constraints, Medhi et al. proposed UAV-DiPNID, a lightweight IDS based on network distillation and pruning, which significantly reduces computational complexity while maintaining competitive accuracy [15]. To address severe class imbalance in airborne UAV datasets, Lin et al. introduced an improved stratified sampling and ensemble learning (ISSEL) framework that enhances minority-class representation via distance-based sampling [16]. While these methods demonstrate that deep learning can effectively detect UAV intrusions, they predominantly assume centralized data access, do not consider federated learning, and offer limited robustness against poisoned or malicious updates. Representative UAV-oriented IDS studies from 2020 to 2025 are summarized in Table 1.

In parallel, federated learning has been explored as a privacy-preserving paradigm for intrusion detection in dynamic aerial and UAV networks. Ceviz et al. proposed FL-IDS, a federated IDS for Flying Ad Hoc Networks (FANETs), demonstrating that FL can achieve detection performance comparable to that of centralized models while avoiding the exchange of raw data among UAVs [17]. Banjar et al. introduced FL-DMAN, a federated dynamic multi-scale attention network designed for drone communication security, showing improved real-time detection under bandwidth and resource constraints [18]. Lu proposed a swarm anomaly-detection framework for IoT-enabled UAV networks that integrates a multimodal denoising autoencoder with federated learning, emphasizing scalability and decentralization for large UAV swarms [19]. At a broader level, survey studies on FL-enabled intrusion detection highlight both the benefits of decentralized learning and important security vulnerabilities (e.g., poisoning and adversarial risks), motivating the need for robust aggregation mechanisms in adversarial settings [20]. Although these studies establish FL as a viable solution for UAV security, they do not integrate advanced cross-modal representations, Vision Transformers, or explicit Byzantine-robust aggregation strategies.

Recent state-of-the-art Byzantine-robust federated learning methods have explored different robustness principles. Zhao et al. [21] proposed SEAR, a secure and efficient aggregation protocol that leverages trusted execution environments to protect client model privacy while enabling Byzantine resilience at the aggregation and communication layers. More recently, Kasyap and Tripathy [22] proposed Sine, which mitigates local model poisoning by measuring similarity between client updates, showing that naive similarity metrics are insufficient under adaptive attacks.

While these approaches provide strong defenses in general federated learning settings, they primarily operate in parameter space or rely on similarity-based filtering without explicitly considering semantic consistency in learned representations. In contrast, the proposed ReGCA mechanism evaluates client updates using both prediction-level consistency and feature-level semantic consistency on a trusted reference set. This dual-consistency strategy enables finer-grained detection of malicious behavior, particularly in heterogeneous and cross-modal UAV intrusion-detection scenarios.

Transformers and Vision Transformers (ViTs) have recently attracted increasing attention for intrusion detection beyond UAV-specific contexts. Manocchio et al. proposed FlowTransformer, a transformer-based framework for flow-level intrusion detection, showing that attention mechanisms effectively capture long-range dependencies in network flows and achieve state-of-the-art performance on benchmark NIDS datasets [23].

Zhou et al. introduced HiViT-IDS, which converts one-dimensional network traffic into images and applies a ViT architecture, reporting detection accuracies exceeding 99% on ToN-IoT and Edge-IIoTset datasets [24]. Ding and Wang further demonstrated that packet-sequence-to-image representations, combined with ViT classifiers, can significantly enhance intrusion-detection performance, underscoring the representational power of transformer-based models [25]. Recent cross-layer convolutional attention approaches also confirm that combining local and global context modeling improves the detection of complex and stealthy attacks in drone networks [26]. Time-series-to-image encoding techniques, such as the Gramian Angular Field (GAF), have been proposed to bridge tabular or temporal security data with image-based deep learning. Terzi introduced a GAF-based intrusion detection framework that encodes network traffic features into GAF images and classifies them using CNNs, achieving competitive performance with a relatively simple pipeline [27].

The remainder of this manuscript is structured as follows. Section 2 presents material and methods. Section 3 describes the proposed CM-BRF-ViT framework, including the cross-modal data representation, Vision Transformer architecture, and Byzantine-robust aggregation strategy. Section 4 presents and discusses the experimental results, including ablation studies and robustness analyses. Finally, Section 5 concludes the paper and outlines directions for future research.

2. Materials and Methods

2.1. Problem Definition

We consider the problem of intrusion detection in distributed UAV networks as a supervised classification task under a federated learning setting with adversarial participants. Let D = {D^UAV, D^CP}. Due to the nature of a cross-modal dataset, D^UAV represents cyber-layer UAV network traffic data and D^CP represents cyber-physical data derived from WiFi and network-level interactions.

Each data sample is associated with a label y ∈ {0, 1}, where y = 0 denotes benign traffic, and y = 1 denotes malicious behavior. The learning objective is to train a classifier f(x; θ): x → ŷ that accurately detects intrusions while satisfying the following constraints:

Privacy preservation: raw data remain local to UAV clients;
Byzantine robustness: the global model must be resilient to adversarial updates;
Cross-modal generalization: detection must leverage both cyber and cyber-physical attack manifestations.

Cross-Modal Data Representation via Gramian Angular Field

Both UAV network traffic and cyber-physical features are originally represented as tabular feature vectors. Let x = [x1, x2, …, xN] ∈ R^N denote a normalized feature vector for a single sample, where x = the feature vector x₁, x₂, …, x_n = individual feature values.

R^N = N-dimensional real number space. Here, N denotes the dimensionality of the input feature vector, i.e., the total number of features provided by the underlying dataset. In the cyber dataset, N represents the complete set of network traffic and protocol-level features as defined in the dataset specification. In the cyber-physical dataset, N denotes the combined set of cyber features and physical sensor measurements associated with UAV operations. In both cases, all available features are used without manual feature selection, and feature normalization is applied before Gramian Angular Field (GAF) transformation. For the cyber modality, each sample is represented by an N_cyber-dimensional feature vector, where N_cyber = 22 corresponds to protocol-level and traffic-based features defined in the UAVIDS-2025 dataset. For the cyber-physical modality, each sample is represented by an N_cp-dimensional feature vector, where N_cp = 37 includes WiFi-level and cyber-physical telemetry features. Unless otherwise stated, N denotes the modality-specific feature dimensionality.

To enable the use of image-based deep learning models, each feature vector is transformed into a two-dimensional representation using the Gramian Angular Summation Field (GASF). This transformation consists of three steps.

2.2. Feature Normalization

Each feature is first scaled into the interval

[- 1, 1]

:

{\tilde{x}}_{i} = 2 \cdot \frac{x_{i} - m i n (x)}{m a x (x) - m i n (x)} - 1

(1)

where

{\tilde{x}}_{i}

: normalized feature value, original feature value,

n (x)

,

m a x (x)

minimum and maximum values in the feature vector. Polar coordinate encoding: The normalized features are mapped into angular values:

ϕ_{i} = a r c c o s ({\tilde{x}}_{i}), ϕ_{i} \in [0, π]

(2)

2.3. GASF Construction

For pixel intensities mapped to angular values

ϕ_{i}

, the Gramian Angular Summation Field (GASF) is defined as

G A S F (i, j) = c o s (ϕ_{i} + ϕ_{j})

(3)

The resulting GASF matrix is resized to a fixed resolution of

32 \times 32

, yielding a compact 2D representation:

I \in R^{32 \times 32 \times 1}

.

This transformation preserves pairwise feature correlations and enables unified visual representations across cyber and cyber-physical modalities.

2.3.1. Vision Transformer Backbone

Each GASF image is processed using a Vision Transformer (ViT) architecture designed for lightweight yet expressive representation learning.

2.3.2. Patch Embedding

Given an input GASF image

x \in R^{H \times W \times C}

, the image is divided into non-overlapping patches of size

P \times P

. This yields

N_{p} = \frac{H W}{P^{2}}

(4)

Each patch

p_{i} \in R^{P^{2} C}

is flattened and projected into a

D

-dimensional embedding space:

z_{i} = p_{i} W_{e} + b_{e}, z_{i} \in R^{1 \times D}

(5)

All patch embeddings are concatenated to form

Z = [z_{1}; z_{2}; \dots; z_{N_{p}}] \in R^{N_{p} \times D}

(6)

A learnable positional embedding

E_{pos} \in R^{N_{p} \times D}

is added:

Z_{0} = Z + E_{p o s}

(7)

2.3.3. Transformer Encoder

The encoded sequence is processed by

L

Transformer blocks:

Z_{l + 1} = M L P (L N (Z_{l} + M S A (L N (Z_{l}))))

(8)

The output is passed through a multilayer perceptron (MLP) followed by layer normalization (LN). Each multi-head self-attention (MSA) operation is defined as

M S A (Z) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(9)

where

{h e a d}_{i} = s o f t m a x (\frac{Q_{i} {K_{i}}^{T}}{\sqrt d_{k}}) V_{i}

(10)

2.3.4. Classification Head

The final embedding is aggregated via global average pooling (GAP) and passed through a softmax classifier:

ŷ = s o f t m a x (W_{c} \cdot G A P (Z_{L}))

(11)

2.3.5. Federated Learning Setting

We consider a federated learning environment comprising

K

UAV clients:

C = {1, 2, \dots, K}

Each client

k

holds a private dataset

D_{k}

and locally trains parameters

θ_{k}^{t}

at communication round

t

. Local optimization is performed using stochastic gradient descent:

{θ_{k}}^{t} \leftarrow {θ_{k}}^{t - 1} - η \nabla L (D_{k}; {θ_{k}}^{t - 1})

(12)

A small, trusted reference dataset,

D_{ref}

, is maintained solely on the server. To support robust aggregation under adversarial conditions, we introduce a reference dataset denoted as

D_{ref} .

This dataset comprises benign samples collected under normal operating conditions and is used exclusively on the server side for consistency evaluation. The reference dataset is constructed from trusted data sources and does not include any adversarial or poisoned samples. Importantly,

D_{ref}

is never shared with federated clients and is not used for model training. Instead, it serves as a stable reference for assessing the prediction-level and feature-level consistency of client updates during aggregation.

2.4. Robust-GAF Aggregation

To defend against malicious clients, we introduce Reference-GAF Consistency Aggregation (ReGCA), which evaluates client updates through dual consistency measures.

2.4.1. Prediction Consistency

The KL divergence between global and client predictive distributions is

{D_{k}}^{p r e d} = \frac{1}{| D_{r e f} |} \sum K L (p (y | x_{i}; θ^{t}) ∥ p (y | x_{i}; {θ_{k}}^{t}))

(13)

2.4.2. Feature Consistency

Feature-level divergence is computed using the penultimate layer embedding

f (\cdot)

:

{D_{k}}^{f e a t} = \frac{1}{| D_{r e f} |} \sum {∥ f (x_{i}; θ^{t}) - f (x_{i}; {θ_{k}}^{t}) ∥}^{2}_{2}

(14)

2.4.3. ReGCA Score

The combined consistency score is

S_{k} = α {D_{k}}^{p r e d} + β {D_{k}}^{f e a t}

(15)

With the constraint

α + β = 1

.

Robust normalization is applied using MAD:

z_{k} = \frac{S_{k} - m e d i a n (S)}{1.4826 \cdot M A D (S)}

(16)

Client weights are then assigned as

w_{k} = e x p (- m a x (0, z_{k})) i f z_{k} \leq τ, e l s e 0

(17)

where τ is a robustness threshold.

The final ReGCA score for client k is

{S_{k}}^{R e G C A} = α \cdot K L (p_{k}, p_{g l o b a l}) + (1 - α) {∥ e_{k} - e_{g l o b a l} ∥}_{2}

(18)

2.4.4. Global Update

θ^{t + 1} = θ^{t} + \sum {\tilde{w}}_{k} ({θ_{k}}^{t} - θ^{t})

(19)

The global model is updated via weighted aggregation.

{\tilde{w}}_{k}

are normalized weights.

2.5. Learnable Cross-Modal Fusion

Let

p^{UAV}

and

p^{CP}

denote the attack-probability vectors produced by the cyber and cyber-physical models. These are combined through a learnable fusion network:

p_{f u s e d} = g (p^{U A V}, p^{C P}; θ_{f})

(20)

where

g (\cdot)

is a parametric MLP:

g (P) = σ (W_{2} \cdot R e L U (W_{1} P))

(21)

With

P = [p^{UAV} ∥ p^{CP}]

,

W_{1}, W_{2}

: learnable weights,

σ (\cdot)

: sigmoid activation.

This adaptive fusion compensates for modality imbalance and exploits complementary cyber and physical cues. The final detection probability is

p_{f i n a l} = g (p^{U A V}, p^{C P})

(22)

This formulation enables the network to learn nonlinear interactions among modalities and adaptively weight each modality under different attack conditions.

The adaptive fusion mechanism enables the model to compensate for modality imbalance and capture complementary detection cues. In UAV intrusion scenarios, certain types of attacks may only manifest clearly in one modality (e.g., cyber anomalies vs. physical telemetry deviations). The fusion layer therefore learns how much each modality should contribute to the final decision.

The fusion network thus provides a flexible and robust means of integrating cyber and cyber-physical attack probabilities into a unified prediction. The fusion network is implemented as a lightweight multilayer perceptron and learns the relative contributions of each modality under different attack conditions. This adaptive formulation is crucial in UAV environments, where specific attacks may be predominantly visible in only one modality.

Cyber and cyber-physical UAV features are independently transformed into Gramian Angular Field (GAF) images and processed by modality-specific Vision Transformer (ViT) backbones. Model training is conducted in a federated learning setting, where a central server performs Byzantine-robust aggregation using the proposed ReGCA mechanism. The resulting attack probabilities are adaptively combined through a learnable cross-modal fusion head to produce the final intrusion decision. The proposed model offers several key strengths over existing UAV intrusion detection approaches:

Cross-Modal Awareness: By jointly modeling cyber and cyber-physical information, CM-BRF-ViT captures complementary attack signatures that single-modality detectors miss.

Transformer-Based Global Context Modeling: The ViT backbone enables long-range dependency learning across feature dimensions, leading to superior discrimination of complex and stealthy UAV attacks. Byzantine-Robust Federated Learning: The proposed ReGCA mechanism significantly enhances robustness against poisoned or malicious client updates, maintaining high accuracy even with up to 40% Byzantine participation. Privacy Preservation: Raw UAV traffic and cyber-physical data never leave local devices, making the framework suitable for sensitive and mission-critical deployments.

Adaptive Decision Fusion: The learnable fusion head dynamically balances modality importance, avoiding rigid fusion rules and improving reliability under heterogeneous attack scenarios. Deployment Efficiency: All Byzantine robustness computations are performed on the server, allowing UAV endpoints to remain lightweight and resource-efficient.

In summary, CM-BRF-ViT introduces a holistic UAV intrusion detection architecture that unifies cross-modal representation learning, Vision Transformer modeling, and Byzantine-robust federated optimization.

3. Proposed Model

This section presents the proposed CM-BRF-ViT (Cross-Modal Byzantine-Robust Federated Vision Transformer) framework, designed to provide accurate, privacy-preserving, and adversary-resilient intrusion detection for UAV networks. The model integrates cross-modal feature learning, Vision Transformer-based representation, Byzantine-robust federated aggregation, and adaptive decision fusion within a unified end-to-end framework.

3.1. Architectural Overview

Figure 1 provides a structured overview of the proposed CM-BRF-ViT architecture, highlighting the interaction between client-side cross-modal processing and server-side Byzantine-robust federated aggregation. The diagram clearly emphasizes three conceptual stages—local multimodal encoding, learnable fusion, and federated robustness—which collectively enable accurate and secure intrusion detection in UAV networks. Each UAV client processes its own cyber and/or cyber-physical data streams locally. At the same time, model aggregation and consistency evaluation are performed centrally at the federated server without sharing raw data. Each UAV device independently processes two heterogeneous data streams: cyber modality (e.g., network traffic or protocol-level statistics) and cyber-physical modality (e.g., telemetry, flight dynamics, sensor signals).

Both modalities first undergo tabular preprocessing, followed by GASF transformation, which converts temporal feature sequences into 3 compact matrices. Specifically, the 3 matrices refer to (i) the feature similarity matrix derived from latent representations, (ii) the prediction consistency matrix computed from model outputs on the reference dataset, and (iii) the aggregation weight matrix used to scale client updates during the ReGCA aggregation process. This transformation unifies the modalities into a common 2-D representation, enabling the use of scalable vision architectures.

A shared-weight Vision Transformer (ViT) is applied to each GASF image. Sharing parameters across modalities enforces a consistent representational structure, eliminates redundant learned filters, and improves generalization while reducing communication overhead. The ViT outputs modality-specific attack-probability vectors:

p_{UAV}

for the cyber modality and

p_{CP}

for the cyber-physical modality.

This symmetric processing pipeline ensures that both modalities are encoded with equal expressiveness while capturing complementary anomaly signatures. The two probability vectors are merged using a parametric fusion network,

g (\cdot)

illustrated at the center of the figure. By modeling nonlinear relationships between modalities, this layer adaptively weights cyber versus cyber-physical information based on attack characteristics or noise conditions. The output, p_fused, represents the final local intrusion decision and is also used to update the local model parameters before communicating with the server.

This design explicitly captures cross-modal complementarity, enabling the model to remain effective even when one modality becomes unreliable due to adversarial spoofing or partial observability. The lower block of the figure depicts the federated server, which receives only model updates, never raw data, thus preserving privacy. The server performs four key operations: prediction consistency evaluation via KL divergence, Feature Consistency Evaluation using reference-set embedding distances, MAD-based robust normalization, mitigating the influence of malicious or degenerate clients, Weighted Aggregation using ReGCA scores.

The dashed arrow connecting the server back to the client modules symbolizes the broadcast of the updated global model after each aggregation cycle.

Together, these processes provide strong Byzantine resilience, preventing anomalous model updates from dominating the global parameter trajectory while maintaining high accuracy in benign settings.

3.2. Methodology and Data Flow

3.2.1. Cross-Modal Input Streams

CM-BRF-ViT operates on two heterogeneous but complementary modalities: Cyber modality: UAV network traffic features extracted from UAVIDS-2025. Cyber-physical modality: WiFi and network-level features representing physical-layer and protocol-level behavior. These modalities capture different attack manifestations: network-level anomalies (e.g., flooding, Sybil, wormhole) and cyber-physical inconsistencies (e.g., replay and false data injection).

3.2.2. GAF-Based Visual Representation

Each tabular feature vector is independently transformed into a Gramian Angular Field (GAF) image of size 32 × 32. This step maps one-dimensional feature correlations into a structured image domain while preserving temporal and relational dependencies.

By projecting both cyber and cyber-physical data into a shared visual space, the framework enables a unified representation that can be processed by the same ViT backbone, thereby avoiding handcrafted feature engineering and modality-specific network designs.

3.2.3. Vision Transformer Feature Learning

Each modality is processed by an identical Vision Transformer backbone, consisting of: non-overlapping 4 × 44\times 44 × 4 patch embeddings, learnable positional encodings, multi-head self-attention layers, and feed-forward MLP blocks

The ViT architecture enables global dependency modeling, allowing the detector to capture long-range interactions across features that are difficult to learn using conventional convolutional or recurrent models. This is particularly beneficial in UAV intrusion detection, where attack signatures are often distributed and non-local. Each ViT produces a modality-specific attack probability: pUAV and pCP.

3.2.4. Federated Training with Byzantine-Robust Aggregation

To preserve data privacy, model training is performed via federated learning, in which each UAV client updates its local copy of the CM-BRF-ViT model using its private data. Only model updates are transmitted to the server.

Unlike standard federated approaches, CM-BRF-ViT integrates Reference-GAF Consistency Aggregation (ReGCA) at the server side. ReGCA evaluates each client update using prediction consistency on a trusted reference dataset and feature-level semantic consistency from ViT embeddings. By combining these two criteria with MAD-based robust normalization, ReGCA effectively suppresses malicious or abnormal client updates while preserving honest but heterogeneous contributions. This design enables CM-BRF-ViT to maintain high detection accuracy even when a significant fraction of participating clients behaves in a Byzantine manner.

4. Experimental Results and Discussion

This section evaluates the performance of the proposed CM-BRF-ViT framework through extensive experiments conducted on the UAVIDS-2025 and Cyber-Physical datasets. The evaluation focuses on detection accuracy, robustness against Byzantine clients, cross-modal fusion effectiveness, and component-level contributions, as assessed through ablation studies.

4.1. Evaluation Setup and Metrics

The experimental evaluation follows a federated learning setup with K = 10 participating clients and up to 50 communication rounds. The UAVIDS-2025 [2] dataset represents cyber-level UAV network traffic, while the Cyber-Physical dataset includes WiFi and network-level attack scenarios. Detection performance is evaluated using accuracy, F1-score, and Area Under the Curve (AUC). To assess adversarial resilience, Byzantine behavior is simulated by injecting malicious client updates through label-flipping and gradient-noise attacks at varying client ratios (0–40%). All reported results are averaged over multiple runs to ensure stability.

In the federated learning setup, both the cyber and cyber-physical datasets are partitioned across multiple UAV clients to simulate a realistic distributed environment. Each UAV is treated as an independent federated client and is assigned a local subset of the data corresponding to its operational observations. The data distribution across clients is non-IID, reflecting heterogeneous traffic patterns, sensor readings, and mission conditions encountered by different UAVs. No data samples are shared among clients, and each UAV performs local training on its assigned subset before transmitting model updates to the central server.

All experiments were implemented using Python 3.12.12 within a controlled and unified software environment to ensure full reproducibility and fair comparison. Both the proposed model and baseline methods were implemented using PyTorch 2.9.0 (CUDA 12.6). Supporting libraries, including NumPy 2.0.2, Scikit-learn 1.6.1, and Matplotlib 3.10.0, were employed for data preprocessing, performance evaluation, and result visualization. Training and evaluation were conducted on a GPU-enabled computing platform under identical software and hardware conditions.

The cyber-physical dataset used in this study is distinct from UAVIDS-2025 and consists of WiFi-level and cyber-physical telemetry features collected under both normal operation and cyberattack scenarios. This dataset is publicly available and was originally introduced in [28], which provides detailed information on data collection, feature definitions, and attack scenarios. In contrast, UAVIDS-2025 [2] is used exclusively for cyber-layer UAV network traffic analysis.

4.2. Results on UAVIDS-2025 Dataset

4.2.1. Detection Performance Evaluation

Table 1 summarizes the detection performance on the UAVIDS-2025 dataset. The proposed CM-BRF-ViT framework achieves 97.1% accuracy, significantly outperforming conventional federated baselines. The high detection accuracy indicates that the GAF-based ViT representation effectively captures discriminative network traffic patterns. Compared with FedAvg-based ViT models, CM-BRF-ViT provides an absolute improvement of approximately five percentage points, confirming that attention-based global modeling is beneficial for UAV intrusion detection. The confusion matrix analysis reveals a low false-negative rate, which is particularly important for UAV security scenarios where missed attacks can lead to severe operational consequences. The stable convergence across communication rounds further demonstrates that federated training does not degrade the ViT model’s discriminative capacity when combined with robust aggregation.

Figure 2 presents the comprehensive experimental results of CM-BRF-ViT on the UAVIDS-2025 benchmark dataset, including federated learning convergence, Byzantine-robust client filtering, final performance metrics, and binary-classification confusion matrices.

The inference pseudocode demonstrates a structured and principled approach to processing heterogeneous UAV telemetry and cyber data. By normalizing features, transforming them into GASF images, encoding them with a shared ViT, and applying a learned cross-modal fusion mechanism, the model delivers robust, fully integrated intrusion prediction. The threshold-aware decision rule further ensures applicability in safety-critical UAV environments, where calibrated outputs and transparent decision boundaries are essential.

Figure 2 presents a detailed evaluation of the proposed CM-BRF-ViT model on the UAVIDS-2025 dataset, using both raw confusion-matrix counts and normalized percentage-level classification outcomes. Together, these visualizations illustrate the model’s robustness in distinguishing between benign (Normal) and malicious (Attack) UAV activities under realistic operational conditions. The left panel reports the absolute prediction counts across the two binary classes. The model correctly classifies: 3834 Normal samples as Normal, and 14,335 Attack samples as Attack.

Only 83 Normal samples were misclassified as Attack (false positives), and 74 Attack samples were misclassified as Normal (false negatives). The minimal number of false negatives is particularly relevant in intrusion detection, where failing to detect an actual attack is significantly more costly than a false alarm. These raw counts demonstrate the model’s capacity to maintain very low error rates across a large test population. The inference procedure of the proposed CM-BRF-ViT model is summarized in Algorithm 1.

Algorithm 1: Inference Procedure of CM-BRF-ViT
Step	Description
Input	$Cyber feature vector x_{cyber}$ $; Cyber-physical feature vector x_{phys}$ $; Trained model parameters θ^{*}$ $; Decision threshold {τ .}_{dec}$
Output	$Predicted label \hat{y}$ $; Fused attack probability p_{fused}$
1	Normalize cyber and cyber-physical features.
2	$Transform normalized features into GASF images I_{cyber}, I_{phys}$
3	$Encode each image using shared ViT to obtain (p_{UAV}, e_{UAV})$ $and (p_{CP}, e_{CP})$
4	$Concatenate probability vectors z = [p_{UAV} ∥ p_{CP}]$
5	$Compute fused probability p_{fused} = FusionMLP (z)$
6	$Apply decision rule : \hat{y} = a r g m a x (p_{fused})$ $(or threshold-based decision if τ_{dec}$ is provided)
Return	$\hat{y}, p_{fused}$

The right panel of Figure 2 presents normalized percentages, enabling comparisons independent of class imbalance. The model achieves 97.88% true-positive recognition of Normal behavior, with only 2.12% false alarms, 99.49% true-positive recognition of Attacks, with only 0.51% missed detections. The near-perfect classification of attack behaviors highlights the effectiveness of the cross-modal fusion mechanism and the rich GASF–ViT representation, allowing the model to capture subtle deviations in both cyber and cyber-physical channels. The very low false-negative rate further supports the model’s suitability for safety-critical UAV environments.

Figure 3 provides a comprehensive analysis of the attack-probability behavior of the proposed CM-BRF-ViT model across different UAVIDS-2025 attack categories. The four subfigures collectively demonstrate the model’s statistical reliability, its discriminative sharpness between benign and malicious behavior, and its robustness across heterogeneous attack types. The density plot (top left) illustrates a distinct bimodal separation between Normal and Attack samples. Normal instances cluster sharply near zero probability, while Attack instances are concentrated near one, with almost no overlap between the two distributions. The vertical threshold line at 0.5 highlights that the model’s natural probability separation aligns perfectly with the canonical decision boundary, indicating extremely low prediction ambiguity, high confidence for both classes, and strong calibration of the fused probability output.

This clear separability is a strong indicator of effective multimodal representation and successful cross-modal fusion. The box-and-whisker visualization (top right) compares predicted attack probabilities across major UAVIDS-2025 attack categories, including Blackhole, Flooding, Normal Traffic, Sybil, and Wormhole behaviors. All attack categories exhibit high median probabilities near 1.0, indicating consistent detection performance across diverse adversarial patterns. Standard Traffic samples maintain probabilities near 0, demonstrating robust avoidance of false positives. The narrow interquartile ranges for most attack types suggest low variance and high certainty, even in scenarios where signal characteristics vary widely. Occasional outliers are present but remain well above the decision threshold, indicating resilience against noisy or atypical attack signals. Overall, this plot highlights that the model generalizes effectively across multiple attack families without mode collapse or class-specific bias. The mean class probabilities (bottom-left) panel summarizes the central tendency and variability of the predicted probabilities across attack categories. Each class maintains a high mean probability for attacks and a low mean probability for Normal instances, with controlled variance, reflecting stable model behavior.

These results align with expectations for a well-calibrated classifier that maintains consistent confidence across heterogeneous temporal and tabular input patterns. The balance between high accuracy and controlled variance is critical for real-world UAV intrusion detection, where uncertainty must be minimized. The ROC curve (bottom right) demonstrates near-perfect separability between the two classes, achieving an AUC of 0.999. This exceptional performance indicates outstanding sensitivity to attack conditions (a high true-positive rate) and strong specificity against false alarms (a low false-positive rate).

The curve remains close to the upper-left corner across the entire threshold spectrum, confirming the model’s robustness to changes in operating thresholds. Such behavior is essential for deployment across UAV systems with varying tolerance levels for false positives.

Figure 4 presents the federated learning performance of the cyber-physical modality within the CM-BRF-ViT framework. The results capture (Figure 4A) the convergence behavior of validation and test accuracy, (Figure 4B) the ReGCA-based client filtering behavior, and (Figure 4C) the final predictive performance achieved after 10 communication rounds. Together, these plots illustrate the learning dynamics, robustness, and generalization capabilities of the cyber-physical pathway. The accuracy curves (top panel) demonstrate a smooth, stable convergence trajectory across the communication rounds. Accuracy increases from approximately 67–68% in the first round to 72% by round 3, indicating that the cyber-physical modality quickly benefits from shared global knowledge even with decentralized data. Subsequent rounds continue to improve performance, reaching 77.9% validation and 78.5% test accuracy by the final round. The close alignment between the two curves suggests that the model generalizes well and does not overfit or exhibit instability across communication rounds. These trends confirm that the cyber-physical features—though inherently noisier and more variable than purely cyber signals—can be effectively learned in a federated setting using ViT-based architecture. The ReGCA filtering heatmap (Bottom left) indicates that all participating clients remain classified as reliable across all communication rounds. No clients are marked as dropped (red), demonstrating high consistency between local model updates and the trusted server-side reference distribution.

The temporary accuracy drop around communication round 8 is caused by increased inter-client heterogeneity, particularly due to the integration of cyber-physical modality updates with higher variance. This fluctuation reflects the robustness-oriented filtering behavior of the proposed ReGCA aggregation. This stability suggests that the cyber-physical dataset used by each client shows no adversarial poisoning or severe anomalous deviations that violate the ReGCA thresholds. The filtering mechanism serves as a safeguard, ensuring that Byzantine or noisy clients, if present, do not propagate harmful updates to the global model.

The absence of dropped clients validates both the integrity of the dataset and the robustness of the federated update process. The bar plot (bottom right) summarizes the final federated performance: 77.9% validation accuracy and 78.5% test accuracy. These values demonstrate that the cyber-physical modality alone is moderately predictive but less discriminative than the fused cross-modal model. This reinforces the paper’s central hypothesis: cyber-physical signals provide complementary information but require fusion with cyber features to achieve high-performance intrusion detection. Notably, the consistency between validation and test accuracy again highlights strong generalization and the absence of overfitting.

Figure 5 presents the binary classification performance of the proposed CM-BRF-ViT model when applied exclusively to the cyber-physical modality of the UAVIDS-2025 dataset. The two subfigures report (Left) raw confusion matrix counts and (Right) normalized percentage-level performance, providing complementary perspectives on the model’s behavior under single-modality evaluation.

Figure 5 shows that the cyber-physical modality provides reliable but not fully sufficient discrimination for intrusion detection. The model maintains low false-positive rates, which is critical for operational UAV systems where false alarms can trigger unnecessary evasive actions or flight interruptions. The increased false-negative rate relative to the cross-modal model confirms the need to integrate cyber features to capture attack signatures that are invisible to physical telemetry alone. These findings validate the design choice behind the CM-BRF-ViT architecture: cyber-physical data provide valuable yet incomplete signals, and optimal performance emerges only when combined with cyber-modality information via a learnable fusion mechanism.

4.2.2. Cross-Modal Fusion Effectiveness

Figure 6 provides an in-depth examination of the attack-probability behavior of the cyber-physical modality across multiple UAVIDS-2025 attack categories. The four subfigures collectively offer insights into separability between normal and attack behavior, inter-class variability, model calibration, and threshold-based discriminative performance.

The density plot (left) reveals a strongly bimodal distribution of predicted attack probabilities. Standard samples cluster distinctly near the lower end of the probability spectrum (close to 0), while attack samples concentrate sharply around 1.0. Notably, there is minimal overlap between the two distributions, indicating high discriminative clarity. The standard decision threshold of 0.5 (dashed line) aligns precisely with the gap between the two modes. The distribution suggests the model is well calibrated for the cyber-physical channel, despite the inherent noise and variability of sensor-driven features. This confirms that the learned representation effectively captures modal differences between benign flight behavior and adversarial sensor manipulation.

The boxplot (right) compares attack probabilities across several cyber-physical classes: DoS, FDI, Replay, benign, and Evil-Twin-like behaviors. The following patterns emerge: Attack classes (DoS, FDI, Replay, Evil Twin) exhibit median probabilities near 1.0, indicating consistent model confidence across diverse adversarial patterns.

Benign samples retain low median values, with the vast majority falling below the 0.5 decision threshold. Variability in the Replay and FDI classes reflects the temporal and physical complexity of these attack patterns, yet they remain well above the attack threshold.

Outliers are present but do not threaten classification reliability, as they remain clearly separated from the benign distribution.

This plot reinforces the model’s capacity to adapt to multiple cyber-physical attack signatures without degrading performance across classes.

Figure 7 presents a detailed evaluation of the cyber-physical branch of the proposed intrusion detection framework. The mean class probability analysis (left) illustrates that attack classes—DoS, FDI, Replay, and Evil Twin—exhibit consistently higher mean predicted attack probabilities than the benign class. This separation demonstrates that the model effectively captures modality-specific behavioral signatures encoded in cyber-physical telemetry. Although certain attack types (e.g., Replay and Evil Twin) exhibit wider confidence intervals, reflecting their inherent variability, their probability distributions remain distinctly elevated relative to those of benign samples. This indicates robust decision boundaries and low confusion between benign and malicious UAV states.

The ROC curve (right) further quantifies binary detection performance, achieving an AUC of 0.975, which denotes excellent discriminative capability. The curve’s steep rise near the upper-left region indicates high true-positive rates at very low false-positive levels. This is a critical property in UAV security scenarios, where false alarms can disrupt mission autonomy and missed detections may compromise operational safety. The near-perfect AUC demonstrates that even without multimodal fusion, the cyber-physical feature stream alone provides strong separability between benign and attack conditions.

Overall, Figure 7 confirms that the cyber-physical modality provides highly reliable, well-calibrated detection signals, reinforcing its role as an essential component of the CM-BRF-ViT intrusion detection architecture.

Figure 8 presents a comprehensive evaluation of the cross-modal fusion mechanism, demonstrating that the proposed learnable fusion strategy integrates UAV-side cyber signals with cyber-physical telemetry to achieve superior intrusion-detection performance compared with unimodal or fixed-weight baselines.

The cross-modal attack probability space (top left) clearly illustrates the complementary nature of the two modalities. While some samples exhibit high attack probability in only one branch, actual attack instances typically cluster near the top-right region, where both modalities assign high likelihood. This indicates that the UAV-only and cyber-physical-only predictors capture distinct but synergistic aspects of malicious behavior, reinforcing the motivation for learnable fusion. Conversely, benign samples are densely concentrated in the lower-left region, indicating consistent agreement between modalities in normal operational states.

The ROC comparison (top center) further quantifies this complementarity. The UAV-only and CP-only branches achieve AUC values of 0.914 and 0.874, respectively. Although the fixed-fusion baseline (α = 0.5) achieves a very high AUC of 0.996, the learnable cross-modal fusion maintains a competitive AUC of 0.993 while offering better adaptability across operating points. However, the proposed learnable fusion mechanism achieves the highest AUC of 0.993, reflecting its ability to dynamically weight modality contributions based on the statistical evidence present in each sample. This substantial improvement confirms the effectiveness of incorporating cross-modal interactions rather than treating modalities independently.

The fusion method comparison bar chart (top right of subplot cluster) provides a direct visual summary of these findings, with learnable fusion outperforming all alternative approaches. This highlights the model’s ability to exploit nonlinear dependencies between cyber and cyber-physical representations—an ability that fixed or unimodal methods lack.

The precision–recall curve (bottom left) shows that the fused classifier achieves a near-perfect average precision (AP = 0.995), markedly surpassing the UAV-only branch (AP = 0.955). This is particularly important in UAV intrusion scenarios, where the imbalance between benign and malicious traffic can inflate ROC-based metrics; PR curves provide a more sensitive view of precision under such conditions. The near-vertical shape of the fused PR curve demonstrates excellent precision retention even at high recall levels.

The learnable fusion output distribution (bottom center) exhibits a sharply bimodal probability structure, with benign samples tightly clustered near zero and attack samples overwhelmingly concentrated near one. This indicates that the fused classifier produces well-calibrated decision outputs with minimal class overlap—an essential property for operational UAV intrusion detection systems, where uncertainty must be minimized.

Finally, the fused classifier’s confusion matrix (top right) quantitatively confirms this behavior. The model yields very low false-positive and false-negative counts, correctly identifying 1507 attack instances while misclassifying only a small fraction of standard samples. These results collectively demonstrate that the proposed learnable fusion method not only integrates cross-modal information effectively but also produces substantially more reliable and discriminative predictions than any unimodal or static fusion alternative.

Overall, Figure 8 establishes that cross-modal learning with a trainable fusion module is critical for maximizing detection robustness and highlights the intrinsic complementarity between cyber and cyber-physical feature spaces in UAV intrusion detection.

4.2.3. Ablation and Sensitivity Analysis

Figure 9 presents a comprehensive assessment of the model’s component-wise contributions and its Byzantine-robustness characteristics across increasingly adversarial federated learning settings. Together, these results validate both the architectural design choices of the CM-BRF-ViT framework and the resilience of the proposed ReGCA aggregation strategy. The ablation results (top panel) quantify the incremental performance gains achieved by each architectural element. The baseline FedAvg + MLP configuration exhibits the lowest performance, particularly in the F1-score and AUC, underscoring the limitations of shallow classifiers for UAV intrusion features. Incorporating the Vision Transformer (FedAvg + ViT) yields substantial improvements across all metrics, confirming the importance of transformer-based temporal–spatial encoding.

Introducing cross-modal fusion (FedAvg + ViT + Fusion) provides an additional boost, demonstrating that integrating cyber and cyber-physical modalities leads to a more discriminative and robust feature space. The ReGCA + ViT (Single) configuration further improves performance, underscoring the benefits of Byzantine-robust aggregation even without cross-modal fusion. The complete CM-BRF-ViT model achieves the highest scores—96.2% accuracy, 96.0% F1-score, and 96.8% AUC—demonstrating that the synergy of ReGCA, ViT encoding, and cross-modal fusion forms the most effective architecture.

Overall, the ablation results indicate that each component contributes meaningfully, and full integration yields the strongest intrusion-detection capability.

The robustness experiment (bottom left) examines how test accuracy degrades as the proportion of malicious (Byzantine) clients increases. FedAvg suffers rapid, monotonic degradation, collapsing to 45.2% accuracy at 40% adversarial participation. Trimmed Mean and BDRFA demonstrate moderate resilience but still exhibit notable decreases at high Byzantine ratios.

In contrast, the proposed ReGCA method maintains high stability, achieving 89.6% accuracy even when 40% of clients are adversarial. This indicates that ReGCA effectively suppresses manipulated updates while preserving helpful client contributions, ensuring consistent model performance in hostile federated environments typical of UAV networks.

The divergence between ReGCA and other baselines widens as the threat level increases, confirming that standard aggregation rules are insufficient for UAV systems, which are vulnerable to coordinated poisoning attempts.

The degradation analysis (bottom right) quantifies the relative performance drop at a 40% Byzantine ratio. FedAvg exhibits catastrophic vulnerability with a 46.9% accuracy loss, whereas Trimmed Mean and BDRFA reduce the degradation to 19.4% and 12.8%, respectively. ReGCA demonstrates exceptional robustness, with only 6.6% degradation, making it the only method capable of maintaining high performance under severe adversarial conditions.

This result highlights the practical significance of ReGCA for real-world UAV deployments, where communication links and clients cannot be fully trusted. By isolating inconsistent or malicious updates through reliability-aware scoring, ReGCA safeguards model integrity and prevents system-wide collapse.

4.2.4. Byzantine Robustness Analysis

Figure 10 presents a unified overview of the experimental performance of the proposed Cross-Modal Byzantine-Robust Federated Vision Transformer (CM-BRF-ViT) across the UAVIDS-2025 and Cyber-Physical datasets, demonstrating its learning dynamics, classification reliability, and fused cross-modal decision behavior. Panel (A) illustrates the convergence behavior of federated training for both datasets. UAVIDS-2025 exhibits rapid improvement during early communication rounds, stabilizing above 97% accuracy by round 10. The Cyber-Physical dataset follows a smoother trajectory, reaching approximately 78% accuracy. This contrast reflects intrinsic modality differences, yet both curves demonstrate stable training without oscillations—an indication that the proposed ReGCA aggregation effectively suppresses noisy or adversarial updates. Panel (B) compares ROC curves for UAV-only, cyber-physical-only, and random baselines. The UAV modality achieves the highest discrimination capability, with an AUC of 0.994, reflecting the strong separability of attack patterns in UAV telemetry. The cyber-physical modality achieves an AUC of 0.974, confirming the model’s generalization across heterogeneous feature sources. Both far exceed the random baseline (AUC = 0.501).

The ROC curves further indicate that ViT encoders paired with ReGCA aggregation achieve near-optimal detection performance under federated constraints.

UAVIDS-2025 achieves extremely low false-negative (74) and false-positive (83) rates despite large sample sizes, confirming strong sensitivity and specificity.

Cyber-physical results show slightly higher false-negative rates, consistent with the less structured nature of physical sensor traces.

Overall, both unimodal classifiers produce reliable predictions that serve as robust inputs to the cross-modal fusion stage.

Panel (E) demonstrates the effect of learnable cross-modal fusion. The fused classifier substantially reduces misclassification relative to the unimodal systems, yielding only 3 false positives and 20 false negatives. This represents a significant improvement in both precision and recall, illustrating that cyber and cyber-physical cues are complementary and mutually reinforcing when processed jointly.

Panel (F) shows the probability density of fused attack predictions. The distribution is distinctly bimodal, with normal scores concentrated near 0.2 and attack scores near 1.0, separated by a wide margin around the decision threshold (0.5). This sharp separation indicates high confidence and low model uncertainty, confirming that the fusion layer effectively integrates multimodal evidence into a stable, well-calibrated decision boundary.

Panel (G) evaluates the robustness of the proposed ReGCA aggregation mechanism under increasing proportions of Byzantine (malicious) clients, comparing performance against the standard FedAvg baseline.

FedAvg exhibits monotonic degradation, dropping from ≈approximately 95% accuracy to 45.2% when 40% of participating clients are adversarial. This sharp decline highlights FedAvg’s vulnerability to poisoned or inconsistent updates. ReGCA (Ours) consistently maintains performance above 89%, even at the highest Byzantine ratio tested (40%). The near-flat performance curve of ReGCA demonstrates strong resistance to gradient-poisoning attacks, effective filtering of anomalous updates, and stable global convergence under adversarial pressure.

Overall, the results confirm that ReGCA provides substantial.

Resilience: outperforms FedAvg by over 44 percentage points at a 40% Byzantine ratio, making it a suitable choice for safety-critical UAVs and cyber-physical systems.

Panel (H) summarizes four core performance dimensions of the CM-BRF-ViT framework and UAVIDS Accuracy: 97.1%.

This demonstrates strong detection capability in UAV telemetry data, benefiting from both GASF encoding and ViT-based feature extraction. Cyber-Physical Accuracy: 78.5%. This highlights effective generalization to heterogeneous sensor-based intrusion scenarios where attack signatures are more subtle and less structured. Fused AUC: 99.3%.

The extremely high AUC supports the advantage of learnable cross-modal fusion, which leverages complementary evidence across cyber and physical modalities. Byzantine robustness at 40%: 89.6%.

Indicates that the full CM-BRF-ViT pipeline—combining ViT encoders, fusion layers, and ReGCA aggregation—maintains high predictive quality even under severe adversarial contamination. Collectively, these results show that CM-BRF-ViT achieves state-of-the-art multimodal intrusion-detection accuracy while demonstrating exceptional robustness against adversarial clients in federated learning settings.

Table 1 demonstrates that CM-BRF-ViT consistently outperforms all baseline federated intrusion detection models across four primary evaluation criteria: cyber-layer accuracy (UAVIDS), cyber-physical accuracy, fused AUC, and Byzantine robustness.

CM-BRF-ViT achieves 97.1% accuracy, outperforming FedAvg + ViT by +5.0 percentage points. This improvement validates the contribution of GAF-based visual encoding and ViT’s long-range dependency modeling.

By achieving 78.5%, the model surpasses all averaging-based methods, demonstrating that semantic consistency constraints in ReGCA help stabilize representations, even for subtle physical-layer anomalies. Near-perfect fusion AUC = 0.993:

The fused classifier yields almost ideal separability between attack and benign samples.

The +0.121 AUC gain over FedAvg + MLP highlights the strength of adaptive cross-modal fusion. With 89.6% accuracy at 40% malicious clients, CM-BRF-ViT outperforms FedAvg by over 44 percentage points and Trimmed Mean by 17.2 points, demonstrating the effectiveness of joint prediction–feature consistency scoring.

CM-BRF-ViT offers a comprehensive advantage, combining strong predictive performance, strong cross-modal generalization, and exceptional adversarial robustness. None of the baseline approaches delivers strong results across all criteria simultaneously.

Table 2 provides a detailed robustness analysis under increasing proportions of Byzantine adversaries. The results clearly indicate that ReGCA significantly outperforms both FedAvg and Trimmed Mean in all adversarial settings. At 0% adversaries: ReGCA improves accuracy to 97.1%, confirming that robust aggregation does not harm performance even without adversarial pressure. 10–20% Byzantine clients:

While FedAvg collapses rapidly (85.4% → 78.5%), ReGCA maintains over 94% accuracy, illustrating resilience to moderate poisoning. When there is 40% adversarial participation, this is the most striking scenario: FedAvg collapses to 45.2%, essentially unusable; Trimmed Mean degrades to 72.4%, showing partial protection; ReGCA achieves 89.6%, maintaining high reliability; and FedAvg aggregates updates agnostically and is easily poisoned.

Trimmed Mean removes extreme gradients but fails against feature-level inconsistency attacks. ReGCA uses dual consistency metrics (prediction + embedding space), ensuring that even semantically incorrect updates are filtered. MAD normalization provides robustness against colluding adversaries.

ReGCA demonstrates state-of-the-art Byzantine resilience, maintaining operational reliability even when nearly half of the participating clients are malicious.

4.3. Performance Comparison

Table 3 provides a structured overview of representative intrusion detection approaches developed for UAV and federated learning environments over the past five years. The studies included cover a broad methodological spectrum, ranging from centralized CNN-based classifiers and distilled lightweight models to federated multi-scale attention architectures.

Many recent UAV intrusion detection studies (e.g., [14,26]) assume access to a centralized dataset. This assumption is inherently inconsistent with realistic UAV deployments, in which data are distributed, sensitive, and often bandwidth-limited. As shown in Table 3, only a small subset of studies adopt federated paradigms [18], yet even these do not consider adversarial resilience or cross-modal feature fusion. Absence of Byzantine Robustness: None of the surveyed UAV-specific or IoT-driven IDS models integrates robust aggregation or adversarial defense mechanisms. Even FL-based IDS relies on the FedAvg paradigm, which is notoriously vulnerable to model poisoning and label flipping. As UAV swarms operate in contested environments and may be targeted by rogue nodes, the absence of Byzantine-tolerant design is a critical research gap.

Existing work considers either cyber traffic (e.g., network logs) or cyber-physical data (e.g., sensor readings), but few integrate both modalities. Modalities in UAV systems often carry complementary information. Cyberattacks manifest as packet-level anomalies; physical attacks manifest as flight-behavior deviations. To assess robustness against Byzantine attacks, the proposed CM-BRF-ViT framework is compared with state-of-the-art federated aggregation strategies, including standard FedAvg, Trimmed Mean, and representative Byzantine-robust aggregation paradigms reported in the literature. These include secure aggregation approaches such as SEAR [21], which leverages trusted execution environments to protect client model privacy while enabling Byzantine resilience, and similarity-based poisoning attacks exemplified by Sine [22], which exploits vulnerabilities in cosine similarity to amplify model poisoning. All methods are evaluated under identical adversarial settings with varying ratios of malicious clients. The table highlights that no prior work jointly models these modalities through learnable representations. The proposed CM-BRF-ViT fills these omissions through four key innovations:

Cross-modal GAF representations unify cyber and cyber-physical signals into a consistent visual encoding space—an ability not present in earlier studies. ReGCA-based semantic-level consistency enhances the reliability of federated training by aligning both predictive distributions and latent features across clients.

Byzantine-robust fusion + aggregation ensures stable performance even under 30–40% malicious participation, a robustness not provided by any prior UAV IDS model.

Transformer-based modeling (ViT) provides an expressive backbone for learning modality interactions at scale.

Thus, Table 3 situates your work as the first to jointly address privacy, cross-modality, and Byzantine resilience for UAV intrusion detection.

Table 4 summarizes prior research on transformer-based intrusion detection, GAF-based feature encoding, and Byzantine-robust federated learning. The comparison reveals apparent methodological fragmentation across three domains—network IDS, time-series conversion, and robust FL—none of which simultaneously address the multi-modal and adversarial challenges inherent to UAV systems. Works such as [23] or [24] demonstrate the growing popularity of transformer backbones in intrusion detection. However, all models in this category operate under strictly centralized, single-data-source, and non-federated assumptions. Consequently, they are not directly applicable to UAV networks where decentralization and modality heterogeneity are intrinsic. Studies such as [27,29] successfully leverage GAF transformations but entirely ignore federated training and adversarial robustness. These models highlight the strength of image-based encodings but lack mechanisms to ensure trustworthiness in collaborative environments.

Byzantine-robust aggregation methods (e.g., [8,31]) provide strong theoretical guarantees but do not operate on cross-modal features, integrate semantic-level consistency, or handle the cyber-physical complexity of UAV systems.

This combination is essential for maintaining reliable collaborative detection across non-IID UAV clients under adversarial risk, yet it is absent in the literature. Relative to the works summarized, CM-BRF-ViT introduces the First cross-modal ViT-based GAF intrusion detection model for UAV cyber-physical environments. First integration of semantic-level Byzantine defense (ReGCA) into a transformer-driven multimodal IDS.

4.4. System Overhead and Deployment Considerations

This subsection provides an analytical discussion of system-level overhead and deployment-related considerations associated with the proposed CM-BRF-ViT framework. The purpose of this analysis is not to claim real-world deployment readiness, but to clarify the expected communication, storage, and scalability characteristics of the method and to delineate the scope of the current study.

4.4.1. Communication Overhead

In the proposed federated learning setup, communication overhead primarily arises from exchanging model updates between UAV clients and the central aggregator in each communication round. Each participating client transmits a serialized model update (or weight difference) to the server, while the server broadcasts the updated global model to the clients.

The total uplink communication cost per round scales linearly with the number of participating clients and the size of the local model update. Similarly, the downlink cost depends on the size of the global model distributed to clients. Since the proposed ReGCA aggregation operates on received updates without introducing additional model parameters, it does not increase the communication payload beyond standard federated aggregation mechanisms. Therefore, the communication complexity of CM-BRF-ViT remains comparable to conventional federated learning baselines under identical model configurations.

4.4.2. Storage Requirements

Client-side storage requirements are dominated by the local model parameters and, when applicable, lightweight optimizer states. No additional long-term storage is required at the client level for ReGCA beyond temporary buffers used during local training. On the server side, storage consists of global model parameters and a small set of aggregation-related statistics used to compute consistency-based weights.

Importantly, ReGCA does not require maintaining historical model ensembles or large client reputation histories, which helps limit server-side storage growth. As a result, both client and server storage requirements remain modest and scale primarily with the base model size rather than the number of clients.

4.4.3. Scalability with Increasing Number of Clients

From a scalability perspective, the communication cost of CM-BRF-ViT increases linearly with the number of participating clients per round, which is consistent with standard federated learning paradigms. The computational overhead introduced by ReGCA is associated with evaluating prediction-level and feature-level consistency for each received update. This computation scales linearly with the number of participating clients and depends on the dimensionality of the learned representations and the size of the trusted reference set.

Because the reference set is fixed and shared across rounds, the additional aggregation cost remains stable across training and does not grow with the total number of communication rounds. This design enables CM-BRF-ViT to scale to moderate-sized UAV swarms without introducing superlinear aggregation complexity.

4.4.4. Behavior Under Packet Loss and Intermittent Participation

Although this study does not include network-level simulations or hardware-in-the-loop experiments, the expected behavior of the proposed framework under realistic network conditions can be discussed qualitatively. In the presence of packet loss or intermittent client participation, some client updates may fail to reach the server in a given communication round. In such cases, aggregation proceeds using the subset of successfully received updates, which is a standard assumption in federated learning systems.

Under reduced client participation, convergence may slow due to fewer updates contributing to the global model. However, the consistency-based design of ReGCA remains applicable, as aggregation weights are computed only for available updates. Consequently, the framework is expected to degrade gracefully in convergence speed rather than exhibit unstable or catastrophic behavior, provided that a sufficient number of benign clients remain active.

4.5. Ablation Study

An ablation study was conducted to assess the contribution of each significant component. Removing the ReGCA aggregation leads to a performance drop of approximately 2.9%, confirming its central role in maintaining global model integrity. Turning off the cross-modal fusion reduces AUC by 1.6%, highlighting the benefit of adaptive modality weighting. Replacing the ViT backbone with a CNN-based classifier results in an additional 3.2% reduction in detection accuracy. Overall, relative to the weakest federated baseline (FedAvg + MLP), the complete CM-BRF-ViT framework provides a cumulative performance gain of +7.9% in detection accuracy, validating the synergistic effect of its design components.

4.6. Discussion

The experimental findings demonstrate that CM-BRF-ViT successfully addresses key limitations of existing UAV intrusion detection systems. First, the GAF-ViT architecture enables attention-based global feature modeling without handcrafted feature engineering. Second, the federated learning framework preserves data privacy while maintaining high detection accuracy. Third, the proposed ReGCA mechanism significantly enhances robustness against Byzantine adversaries, a capability absent in most prior UAV IDS solutions.

The performance gap between UAV network traffic and cyber-physical data suggests that further improvements may be achieved by incorporating domain-specific feature enhancement for cyber-physical signals. Nevertheless, the strong cross-modal fusion performance indicates that joint modeling already mitigates this challenge to a large extent. From a deployment perspective, CM-BRF-ViT remains computationally feasible for UAV systems because robustness checks are performed centrally, while local clients execute only lightweight ViT inference.

As observed in Figure 4A, a temporary decrease in accuracy occurs around communication round 8. This behavior coincides with the stage at which the federated model begins to incorporate a higher degree of inter-client heterogeneity, particularly from the cyber-physical modality. Compared to cyber-only inputs, cyber-physical data exhibit higher variance and noisier feature distributions, which initially challenge the global model’s consistency.

At the same time, the proposed ReGCA aggregation mechanism actively down-weights client updates that demonstrate lower semantic and predictive consistency with the trusted reference set. This selective filtering suppresses inconsistent or potentially malicious updates, which may temporarily reduce validation accuracy. In subsequent communication rounds, as unreliable updates are progressively filtered out and the global model parameters stabilize, the aggregation process converges toward a more robust solution. Consequently, accuracy recovers and continues to improve.

This behavior highlights the robustness-driven nature of ReGCA, where short-term accuracy fluctuations are an expected and acceptable trade-off for long-term stability and resilience under Byzantine participation.

5. Conclusions

This paper introduced CM-BRF-ViT, a Byzantine-robust and cross-modal federated learning framework for UAV intrusion detection. The proposed approach integrates GAF-based visual encoding, Vision Transformer-driven global context modeling, and a novel ReGCA aggregation strategy designed to ensure privacy preservation, resilience to adversarial client behavior, and reliable model convergence under heterogeneous data conditions.

Comprehensive experiments demonstrate that CM-BRF-ViT achieves 97.1% accuracy on the UAVIDS-2025 cyber modality and 78.5% on cyber-physical data, and maintains 89.6% accuracy under 40% Byzantine clients, significantly outperforming conventional federated baselines such as FedAvg, Trimmed Mean, and BDRL. Furthermore, the learnable cross-modal fusion mechanism yields a near-perfect AUC of 0.993, underscoring the effectiveness of jointly modeling complementary intrusion signals across modalities. While this work focuses on algorithmic robustness and cross-modal federated learning under Byzantine attacks, full system-level validation in real UAV networks is identified as future work.

Despite these strong results, the reliance on fixed datasets and offline batch-based evaluation may limit real-time adaptivity in dynamic UAV environments. Future research will focus on extending the framework to support online learning, concept-drift adaptation, and scalable UAV swarm deployments. Promising directions include lightweight transformer variants for resource-constrained aerial platforms, the incorporation of differential privacy mechanisms to enhance confidentiality, and the evaluation of the framework in large-scale, latency-sensitive UAV networks operating under realistic communication constraints.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available and publicly accessible online: https://www.kaggle.com/datasets/qinglizeng1997/uavids-2025/data (accessed on 7 November 2025). https://github.com/uamughal/UAVs-Dataset-Under-Normal-and-Cyberattacks/blob/main/Dataset_T-ITS.csv (accessed on 3 November 2025).

Acknowledgments

This research utilized the UAVIDS-2025 dataset and the CyberPhysical dataset for evaluation purposes. We acknowledge the publicly available benchmark datasets that contributed to the advancement of UAV network security research.

Conflicts of Interest

The author declares no conflicts of interest.

References

Shakhatreh, H.; Khreishah, A.; Alsarhan, A.; Khalil, I.; Yousef, N.; Shakhatreh, M. Unmanned aerial vehicles: A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Zeng, Q.; Bashir, A.; Nait-Abdesselam, F. UAVIDS-2025: A Benchmark Dataset for Intrusion Detection in UAV Networks Using Machine Learning Techniques. In Proceedings of the IEEE Conference on Communications and Network Security (CNS), Avignon, France, 8–11 September 2025; IEEE: Paris, France, 2025. [Google Scholar] [CrossRef]
Choudhary, G.; Sharma, V.; You, I. Intrusion detection systems for networked UAVs: A survey. In Proceedings of the IEEE International Wireless Communications and Mobile Computing Conference (IWCMC), Limassol, Cyprus, 25–29 June 2018; IEEE: Nicosia, Cyprus, 2018; pp. 560–565. [Google Scholar]
Javaid, A.Y.; Sun, W.; Devabhaktuni, V.K.; Alam, M. Cyber Security Threat Analysis and Modeling of an Unmanned Aerial Vehicle System. In Proceedings of the 2012 IEEE Conference on Technologies for Homeland Security (HST), Waltham, MA, USA, 13–15 November 2012; pp. 585–590. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; IEEE: New York, NY, USA, 2017; pp. 1273–1282. [Google Scholar]
Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 119–129. [Google Scholar]
Yin, D.; Chen, Y.; Kannan, R.; Bartlett, P.L. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5650–5659. [Google Scholar]
Fang, M.; Cao, X.; Jia, J.; Gong, N. Local model poisoning attacks to Byzantine-robust federated learning. In Proceedings of the 29th USENIX Security Symposium, Boston, MA, USA, 12–14 August 2020; pp. 1605–1622. [Google Scholar]
Xie, C.; Koyejo, O.; Gupta, I. Generalized Byzantine-Tolerant SGD. arXiv 2018, arXiv:1802.10116. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Wang, Z.; Oates, T. Imaging time-series to improve classification and imputation. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3939–3945. [Google Scholar]
Ahmad, W.; Almaiah, M.A.; Ali, A.; Al-Shareeda, M.A. Deep learning-based network intrusion detection for unmanned aerial vehicle (UAV). In Proceedings of the 7th World Conference on Computing and Communication Technologies (WCCCT), Chengdu, China, 12–14 April 2024 ; pp. 31–36. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, C.; Geng, B. Deep multimodal data fusion. ACM Comput. Surv. 2024, 56, 216. [Google Scholar] [CrossRef]
Whelan, J.; Almehmadi, A.; El-Khatib, K. Artificial intelligence for intrusion detection systems in unmanned aerial vehicles. Comput. Electr. Eng. 2022, 99, 107784. [Google Scholar] [CrossRef]
Medhi, J.; Liu, R.; Wang, Q.; Chen, X. A lightweight and efficient intrusion detection system (IDS) for unmanned aerial vehicles. Neural Comput. Appl. 2025, 37, 15819–15836. [Google Scholar] [CrossRef]
Lin, L.; Ge, H.; Zhou, Y.; Shangguan, R. UAV Airborne Network Intrusion Detection Method Based on Improved Stratified Sampling and Ensemble Learning. Drones 2025, 9, 604. [Google Scholar] [CrossRef]
Ceviz, O.; Sadioglu, P.; Sen, S.; Vassilakis, V.G. A Novel Federated Learning-Based IDS for Enhancing UAVs Privacy and Security. Internet Things 2025. [Google Scholar] [CrossRef]
Banjar, A.; Alshdadi, A.A. Federated learning based dynamic multi-scale attention network for secure drone and base station communication. Alex. Eng. J. 2025, 132, 74–94. [Google Scholar] [CrossRef]
Lu, Y.; Yang, T.; Zhao, C.; Chen, W.; Zeng, R. A Swarm Anomaly Detection Model for IoT UAVs Based on Multi-Modal Denoising Autoencoder and Federated Learning. Comput. Ind. Eng. 2024, 196, 110454. [Google Scholar] [CrossRef]
Ferrag, M.A.; Friha, O.; Maglaras, L.; Janicke, H.; Shu, L. Federated Deep Learning for Cyber Security in the Internet of Things: Concepts, Applications, and Experimental Analysis. IEEE Access 2021, 9, 138509–138542. [Google Scholar] [CrossRef]
Zhao, L.; Jiang, J.; Feng, B.; Wang, Q.; Shen, C.; Li, Q. SEAR: Secure and Efficient Aggregation for Byzantine-Robust Federated Learning. IEEE Trans. Dependable Secur. Comput. 2022, 19, 3329–3342. [Google Scholar] [CrossRef]
Kasyap, R.; Tripathy, B.K. Sine: A Similarity-Based Defense Against Model Poisoning in Federated Learning. IEEE Trans. Dependable Secure Comput. 2024, 21, 4481–4494. [Google Scholar] [CrossRef]
Manocchio, L.D.; Layeghy, S.; Lo, W.W.; Kulatilleke, G.K.; Sarhan, M.; Portmann, M. Flow Transformer: A transformer framework for flow-based network intrusion detection systems. Expert Syst. Appl. 2024, 241, 122564. [Google Scholar] [CrossRef]
Zhou, H.; Zou, H.; Li, W.; Li, D.; Kuang, Y. HiViT-IDS: An Efficient Network Intrusion Detection Method Based on Vision Transformer. Sensors 2025, 25, 1752. [Google Scholar] [CrossRef]
Ding, Y.; Wang, H. Packet-sequence image representation with Vision Transformers for network intrusion detection. J. Xidian Univ. 2023, 50, 112–120. [Google Scholar]
Aldossary, M.; Alzamil, I.; Almutairi, J. Enhanced Intrusion Detection in Drone Networks: A Cross-Layer Convolutional Attention Approach for Drone-to-Drone and Drone-to-Base Station Communications. Drones 2025, 9, 46. [Google Scholar] [CrossRef]
Terzi, R. Gramian angular field transformation-based intrusion detection. Comput. Sci. 2022, 23, 571–585. [Google Scholar] [CrossRef]
Hassler, A.; Mughal, M.; Ismail, M. Cyber–Physical Dataset for Federated Intrusion Detection in UAV Networks. IEEE Dataport. 2023. Available online: https://ieee-dataport.org/documents/cyber-physical-dataset-uavs-under-normal-operations-and-cyber-attacks (accessed on 5 February 2026).
Ma, H.; Li, J.; Peng, Y.; Zhou, Z. Multivariate time-series anomaly detection via Gramian Angular Field encoding. Pattern Recognit. Lett. 2023, 170, 86–94. [Google Scholar]
Hassan, M.; Mahmoud, Q.H. Time-series-to-image encoding for intrusion detection in vehicular and WiFi networks. IEEE Access 2023, 11, 99122–99136. [Google Scholar]
Guerraoui, R.; Gupta, N.; Pinot, R. Byzantine Machine Learning: A Primer. ACM Comput. Surv. 2024, 56, 169. [Google Scholar] [CrossRef]
Kritharakis, E.; Makris, A.; Jakovetic, D.; Tserpes, K. FedGreed: A Byzantine-Robust Loss-Based Aggregation Method for Federated Learning. arXiv 2025, arXiv:2508.18060. [Google Scholar]

Figure 1. Architecture of the proposed CM-BRF-ViT framework.

Figure 2. UAVIDS-2025 experimental results.

Figure 3. UAVIDS-2025 attack probability analysis. The first panel shows the attack probability distribution, illustrating clear bimodal separation between normal and attack samples. The dashed vertical red line indicates the classification decision threshold (0.5). The second panel presents attack probability distributions across Blackhole, Flooding, Normal Traffic, Sybil, and Wormhole categories, with the dashed horizontal red line denoting the decision threshold. The third panel displays mean class probabilities with variance indicators. The final panel shows the binary attack detection ROC curve (AUC = 0.999), where the dashed diagonal line represents the random classifier baseline. Colors are used to distinguish normal and attack samples and to differentiate individual attack categories.

Figure 4. Cyber-Physical dataset federated learning results: (A) Training progress showing validation and test accuracy convergence over 10 communication rounds, (B) ReGCA client filtering demonstrating Byzantine-robust aggregation with all clients classified as reliable, (C) final performance achieving 77.9% validation and 78.5% test accuracy.

Figure 5. Cyber-physical binary classification confusion matrices: (Left) Raw counts showing 2008 true negatives, 50 false positives, 544 false negatives, and 5469 true positives, (Right) normalized percentages demonstrating 97.57% benign classification accuracy and 90.95% attack detection rate.

Figure 6. Cyber-physical attack probability analysis. The left panel presents the distribution of predicted attack probabilities for normal and attack samples. The dashed vertical line indicates the decision threshold (0.5) used for classification. The right panel shows the attack probability distribution across DoS attack, FDI, replay, benign, and Evil Twin categories using boxplots. Colors are used to distinguish normal and attack samples as well as individual class categories.

Figure 7. Cyber-physical attack probability analysis. The left panel presents the mean class probabilities with confidence intervals across DoS attack, FDI, replay, benign, and evil-twin categories. The right panel shows the binary attack detection ROC curve, achieving an AUC of 0.975.

Figure 8. Cross-modal fusion analysis. The first panel presents the cross-modal attack probability space, illustrating complementary detection behavior between UAV and CyberPhysical modalities. The second panel shows the ROC curve comparison, including UAV only (AUC = 0.914), CP only (AUC = 0.874), fixed fusion at α = 0.5 (AUC = 0.996), and learnable fusion (AUC = 0.993), where the dashed diagonal line represents the random classifier baseline. The third panel compares fusion methods in terms of AUC. The fourth panel presents the precision–recall curve. The fifth panel shows the learnable fusion output distribution with the decision threshold. The final panel displays the fused classifier confusion matrix.

Figure 9. Byzantine robustness and performance summary of the proposed framework. The upper panel presents the ablation study, illustrating the contribution of individual components to overall model performance in terms of accuracy, F1-score, and AUC. The lower-left panel compares Byzantine robustness across aggregation strategies, showing that ReGCA maintains 89.6% accuracy at a 40% malicious client ratio, while FedAvg experiences significant degradation. The lower-right panel analyzes performance degradation under increasing Byzantine ratios, highlighting the robustness advantage of the proposed method.

Figure 10. CM-BRF-VİT: cross-modal Byzantine-robust federated vision transformer comprehensive experimental results. (A) Federated learning accuracy over communication rounds for UAVIDS-2025 and CyberPhysical datasets. (B) ROC comparison of modality-specific and fused models. (C,D) Dataset-specific confusion matrices. (E) Cross-modal fused confusion matrix. (F) Attack probability distributions with decision threshold. The dashed vertical line denotes the decision threshold for attack classification. (G) Byzantine robustness analysis comparing FedAvg and ReGCA; the shaded region indicates the performance margin under increasing malicious client ratios. (H) Overall performance summary including accuracy, fused AUC, and robustness at 40% Byzantine ratio.

Table 1. Performance comparison of intrusion detection methods across UAVIDS-2025 accuracy, Cyber-Physical accuracy, fused AUC, and Byzantine robustness at 40% malicious client ratio.

Method	UAVIDS Acc. (%)	CP Acc. (%)	Fused AUC	Byz.@40% (%)
FedAvg + MLP	88.3	71.2	0.872	38.4
FedAvg + ViT	92.1	74.8	0.912	45.2
Trimmed Mean + ViT	91.8	73.5	0.905	72.4
CM-BRF-ViT (Ours)	97.1	78.5	0.993	89.6

Table 2. Byzantine robustness comparison at varying malicious client ratios. ReGCA maintains superior performance even at 40% adversarial participation.

Byzantine Ratio	FedAvg Acc. (%)	Trimmed Mean Acc. (%)	ReGCA (Ours) Acc. (%)
0%	92.1	91.8	97.1
10%	85.4	89.6	95.2
20%	78.5	88.2	94.5
40%	45.2	72.4	89.6

Table 3. Representative UAV and FL-based intrusion detection methods (2020–2025).

Year	Study	Scenario/Dataset	Methodology	Key Characteristics/Limitations
2022	Whelan et al. [14]	Multiple UAV datasets (survey)	Taxonomy and review of AI-based UAV IDS	Centralized focus; limited privacy and robustness considerations
2025	Medhi et al. [15]	UAV-IDS 2020	Distilled and pruned CNN-based IDS (UAV-DiPNID)	Lightweight and efficient; no federated learning
2025	Lin et al. [16]	Imbalanced airborne UAV datasets	ISSEL: stratified sampling + ensemble learning	Addresses class imbalance; centralized setting; no federated or Byzantine robustness
2025	Ceviz et al. [17]	FANETs (Flying Ad Hoc Networks)	FL-based IDS (FL-IDS)	Privacy-preserving; relies on FedAvg
2025	Banjar et al. [18]	Drone communication networks	FL-DMAN: federated multi-scale attention network	Improves real-time detection; no Byzantine defense
2024	Lu et al. [19]	IoT-enabled UAV swarms	Multi-modal autoencoder + federated learning	Anomaly-focused; no ViT or cross-modal fusion
2021	Ferrag et al. [20]	Distributed IoT/UAV (survey)	Survey of FL-enabled IDS	Highlights the lack of Byzantine robustness
2022	Zhao et al. [21]	General FL	Secure aggregation via TEE (Intel SGX) with sampling-based Byzantine detection	Aggregation-level privacy only; no cross-modal or UAV-specific design
2024	Kasyap & Tripathy [22] (Sine)	General FL	Model poisoning attack framework exploiting cosine similarity	Attack-only; does not propose a defense; not UAV-specific
2025	Proposed CM-BRF-ViT (This Work)	UAVIDS-2025 + Cyber-Physical	GAF + ViT + Federated Learning + ReGCA + Cross-Modal Fusion	Privacy-preserving, cross-modal, and Byzantine-robust UAV IDS

Table 4. Transformer—GAF and Byzantine-robust FL-related methods (2020–2025).

Year	Study	Domain	Core/Idea	Relation/Gap w.r.t CM-BRF-VİT
2020	Fang et al. [8]	Robust FL	Analysis of poisoning attacks	Shows the vulnerability of classic defenses
2024	Manocchio et al. [23]	Network IDS	FlowTransformer (transformer-based NIDS)	Centralized; no UAV or FL setting
2025	Zhou et al. [24]	IoT/IIoT IDS	HiViT-IDS: traffic-to-image + ViT	Single-modality; no federated learning
2023	Ding and Wang [25]	Network IDS	Packet-sequence image + ViT	Centralized; no UAV focus
2025	Aldossary et al. [26]	Drone IDS	Multi-scale CNN + attention (CLCAN)	Centralized; no FL or Byzantine defense
2022	Terzi [27]	Network IDS	GAF encoding + CNN	No ViT, no FL, single modality
2023	Ma et al. [29]	Sensor anomaly detection	GAF-inspired image encoding	Not applied to intrusion detection in UAVs
2023	Hassan and Mahmoud [30]	Vehicular/WiFi IDS	Time-series-to-image IDS	Centralized deployment
2024	Guerraoui et al. [31]	Federated learning (survey)	Byzantine-tolerant ML survey	Parameter-space focus
2025	Kritharakis et al. (FedGreed) [32]	Robust FL	Robust FL using side information	Not applied to UAV IDS
	Comparison with Proposed CM-BRF-ViT	UAV IDS + FL + ViT	GAF → ViT encoding + ReGCA semantic-level Byzantine defense + learnable cross-modal fusion	First to unify cross-modal ViT learning, GAF representation, federated training, and prediction/feature consistency aggregation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Batur Şahin, C. Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection. Drones 2026, 10, 125. https://doi.org/10.3390/drones10020125

AMA Style

Batur Şahin C. Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection. Drones. 2026; 10(2):125. https://doi.org/10.3390/drones10020125

Chicago/Turabian Style

Batur Şahin, Canan. 2026. "Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection" Drones 10, no. 2: 125. https://doi.org/10.3390/drones10020125

APA Style

Batur Şahin, C. (2026). Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection. Drones, 10(2), 125. https://doi.org/10.3390/drones10020125

Article Menu

Securing UAV Swarms with Vision Transformers: A Byzantine-Robust Federated Learning Framework for Cross-Modal Intrusion Detection

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Problem Definition

2.2. Feature Normalization

2.3. GASF Construction

2.3.1. Vision Transformer Backbone

2.3.2. Patch Embedding

2.3.3. Transformer Encoder

2.3.4. Classification Head

2.3.5. Federated Learning Setting

2.4. Robust-GAF Aggregation

2.4.1. Prediction Consistency

2.4.2. Feature Consistency

2.4.3. ReGCA Score

2.4.4. Global Update

2.5. Learnable Cross-Modal Fusion

3. Proposed Model

3.1. Architectural Overview

3.2. Methodology and Data Flow

3.2.1. Cross-Modal Input Streams

3.2.2. GAF-Based Visual Representation

3.2.3. Vision Transformer Feature Learning

3.2.4. Federated Training with Byzantine-Robust Aggregation

4. Experimental Results and Discussion

4.1. Evaluation Setup and Metrics

4.2. Results on UAVIDS-2025 Dataset

4.2.1. Detection Performance Evaluation

4.2.2. Cross-Modal Fusion Effectiveness

4.2.3. Ablation and Sensitivity Analysis

4.2.4. Byzantine Robustness Analysis

4.3. Performance Comparison

4.4. System Overhead and Deployment Considerations

4.4.1. Communication Overhead

4.4.2. Storage Requirements

4.4.3. Scalability with Increasing Number of Clients

4.4.4. Behavior Under Packet Loss and Intermittent Participation

4.5. Ablation Study

4.6. Discussion

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI