Next Article in Journal
An Algorithm for Planning Coverage of an Area with Obstacles with a Heterogeneous Group of Drones Using a Genetic Algorithm and Parameterized Polygon Decomposition
Previous Article in Journal
MCS-Sim: A Photo-Realistic Simulator for Multi-Camera UAV Visual Perception Research
Previous Article in Special Issue
Secure Communication in Drone Networks: A Comprehensive Survey of Lightweight Encryption and Key Management Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification

Innotec GmbH–TÜV Austria Group, Hornbergstrasse 45, 70794 Filderstadt, Germany
Drones 2025, 9(9), 657; https://doi.org/10.3390/drones9090657
Submission received: 31 July 2025 / Revised: 13 September 2025 / Accepted: 15 September 2025 / Published: 18 September 2025

Abstract

UAVThreatBench introduces the first structured benchmark for evaluating large language models in cybersecurity threat identification for unmanned aerial vehicles operating within industrial indoor settings, aligned with the European Radio Equipment Directive. The benchmark consists of 924 expert-curated industrial scenarios, each annotated with five cybersecurity threats, yielding a total of 4620 threats mapped to directive articles on network and device integrity, personal data and privacy protection, and prevention of fraud and economic harm. Seven state-of-the-art models from the OpenAI GPT family and the LLaMA family were systematically assessed on a representative subset of 100 scenarios from the UAVThreatBench dataset. The evaluation applied a fuzzy matching threshold of 70 to compare model-generated threats against expert-defined ground truth. The strongest model identified nearly nine out of ten threats correctly, with close to half of the scenarios achieving perfect alignment, while other models achieved lower but still substantial alignment. Semantic error analysis revealed systematic weaknesses, particularly in identifying availability-related threats, backend-layer vulnerabilities, and clause-level regulatory mappings. UAVThreatBench therefore establishes a reproducible foundation for regulatory-compliant cybersecurity threat identification in safety-critical unmanned aerial vehicle environments. The complete benchmark dataset and evaluation results are openly released under the MIT license through a dedicated online repository.

1. Introduction

Unmanned Aerial Vehicles (UAVs) are increasingly deployed in industrial and operational technology (OT) environments for tasks such as logistics, inspection, monitoring, and automation. These systems are frequently conceptualized as flying IoT devices [1], offering airborne sensing, connectivity, and autonomy across critical infrastructure domains. UAVs, equipped with sensors, transceivers, and embedded computing, act as mobile cyber–physical platforms with direct interface to industrial systems. Consequently, UAVs are becoming prominent targets for cyberattacks, impacting confidentiality, integrity, and availability.
The European regulatory framework for UAV cybersecurity is multifaceted, encompassing key directives like the Radio Equipment Directive (RED) [2], which specifically addresses radio-connected devices like UAVs, supported by harmonized standards like the EN 18031 series [3,4,5]. Additionally, standards such as ISO 21384-3 [6] provide guidelines for secure UAS operations, while IEC 62443 [7] offers a framework for securing industrial control systems, applicable to UAVs in OT environments. Furthermore, the most recent Cyber Resilience Act (CRA) [8] introduces mandatory cybersecurity requirements for digital products, including connected UAVs, while the NIS2 Directive [9] expands cybersecurity obligations for operators of essential and critical infrastructure across the European Union. These regulations collectively aim to enhance the cybersecurity resilience of UAVs across their lifecycle and operational contexts.
UAV deployments in OT environments introduce risks of unauthorized access, data breaches, and operational disruption. To mitigate such risks, the European Union has reinforced cybersecurity regulations through the Radio Equipment Directive (RED) 2014/53/EU [2], operationalized via Delegated Regulation (EU) 2022/30 [10]. This regulation mandates cybersecurity requirements for all radio-connected equipment, including UAVs, under the following categories:
  • Article (d)–protection of network and device integrity;
  • Article (e)–protection of personal data and privacy;
  • Article (f)–protection against fraud and economic harm.
The harmonized standard ETSI EN 18031-1:2024 [3] formalizes threat modeling and mitigation strategies necessary to demonstrate RED compliance. Risk assessment thus becomes a central requirement in UAV cybersecurity, beginning with the identification of operational scenarios, followed by the enumeration of plausible threats. These threats must then be mapped to the appropriate RED Article(s), after which technical and organizational measures can be defined according to the harmonized standard. The structured workflow is illustrated in Figure 1.
However, the execution of this process remains largely manual and expert-driven. It requires substantial domain knowledge to accurately interpret scenario semantics, identify threat vectors, and ensure regulatory alignment. In this context, Large Language Models (LLMs) have demonstrated growing capabilities in structured information extraction and regulatory reasoning across Natural Language Processing (NLP) tasks. Recent studies suggest they can parse semi-structured inputs, extract cybersecurity-relevant entities, and generate plausible threat hypotheses.
Studies in cyber threat intelligence (CTI) domains report that LLMs often fail to perform reliably on real-world security reports, exhibiting inconsistent behavior and overconfident predictions—even when fine-tuned or prompted with few-shot examples [11]. Complementary research in structured reasoning has shown that LLM outputs are highly sensitive to minor input perturbations, raising concerns about their robustness in high-stakes contexts [12,13]. These findings suggest that while LLMs are promising for knowledge extraction and security reasoning, their performance must be rigorously validated in structured, safety-critical tasks.
At the same time, regulatory frameworks such as RED and EN 18031 impose structured threat identification requirements that are currently addressed through manual expert analysis. Yet the literature on UAV cybersecurity has largely remained focused on general threat taxonomies [14], network-level attacks [15], and architectural vulnerabilities [16]. While some recent studies propose AI-based methods, including machine learning and reinforcement learning, to bolster UAV cybersecurity [17,18,19], these approaches often lack regulatory alignment and do not generalize to natural-language-based threat reasoning. Moreover, emerging work on LLMs in cybersecurity highlights their potential, but empirical studies applying them to RED-aligned UAV–OT scenarios remain absent. This reveals a critical gap: despite the maturity of regulations and the rapid evolution of foundation models, no benchmark currently evaluates LLMs’ ability to perform threat identification under real-world regulatory constraints. Accordingly, this study operationalizes the above gap into three research questions that formalize RED-aligned UAV–OT threat identification and evaluation.

1.1. Research Motivation

This section formalizes the problem into targeted research questions and specifies how subsequent sections address each one. Despite the rapid progress of foundation models, their application in structured cybersecurity risk assessment, especially within regulated domains such as UAV RED compliance, remains underexplored. While LLMs offer promising capabilities for scenario interpretation, threat inference, and regulatory alignment, their reliability in safety-critical contexts is not well understood.
This motivates a need to rigorously assess whether LLMs can support structured UAV cybersecurity tasks aligned with RED. Accordingly, the following research questions are formulated:
  • RQ1: Can LLMs generate cybersecurity threats that align with RED Articles (d), (e), and (f)?
  • RQ2: How do different families of LLMs (e.g., GPT-4, LLaMA) perform in RED-compliant threat identification across UAV–OT interaction scenarios?
  • RQ3: What types of systematic errors or omissions occur in LLM-based threat generation for UAV–OT scenarios, and how do they affect RED article alignment?
To address these questions, there is a need for structured, domain-specific benchmarks that can differentiate genuine reasoning capabilities from pattern-matching artifacts.

1.2. Contributions

To address the above research questions, we present UAVThreatBench, a structured framework for evaluating LLMs in the context of RED-aligned cybersecurity threat identification for UAV deployments in OT environments. The contributions in this study are twofold:
1.
Curated Benchmark Dataset: A novel dataset comprising 924 plausible UAV–OT cyber-physical scenarios was constructed via constrained combinatorial synthesis. Each scenario is defined by structured parameters (e.g., UAV role, communication protocol, cybersecurity origin) and subsequently annotated with expert-curated threats aligned to RED Articles (d), (e), and (f), based on ETSI EN 18031-1 [3].
2.
Empirical Evaluation of LLMs: A systematic evaluation was conducted on seven leading LLMs, including OpenAI’s GPT-4 variants and open-source LLaMA-3 models, assessing their ability to perform structured threat generation. The analysis includes quantitative metrics (e.g., RED Article classification accuracy, match rate) and qualitative insights (e.g., semantic plausibility).
Together, these contributions establish the first RED-compliant benchmark for UAV cybersecurity threat identification and provide empirical evidence on the feasibility and limitations of applying LLMs as a tool for cybersecurity threat assessment in safety-critical operational technology contexts. This open-access benchmark further provides a standardized foundation for structured cybersecurity risk assessment in UAV-OT domains, supporting both reproducibility and regulatory alignment.
UAVThreatBench is a structured benchmark dataset designed to evaluate the ability of LLMs to identify cybersecurity threats in UAV–OT (Unmanned Aerial Vehicle–Operational Technology) interaction scenarios. Each of the 924 benchmark cases includes a machine-generated scenario describing a plausible UAV–OT operational context (e.g., industrial inventory flight, telemetry upload via 5G), annotated with metadata fields such as data flow, wireless channel, interface type, and attacker origin. For each scenario, five expert-curated threats are provided, each labeled with one of the RED Article categories (3.3(d)–(f)) forming the reference ground truth. LLMs are prompted with the scenario and tasked with predicting two threats. Their predictions are evaluated by comparing against this ground truth to compute perfect match, partial match, and RED category alignment. This benchmark thus enables standardized, RED-aligned evaluation of LLMs in safety-critical, real-world cybersecurity contexts. Rather than reproducing regulatory text in full, this study distills the core requirements into a structured benchmark (UAVThreatBench), providing a transparent and usable framework for evaluating LLMs in UAV–OT cybersecurity under RED compliance.
Further, it is important to clarify that UAVThreatBench focuses on threats that fall within the scope of the EU Radio Equipment Directive (Articles 3.3(d)–(f)), namely device and network integrity, personal data and privacy, and protection against fraud or economic harm. These categories primarily correspond to communication-layer, data-handling, and backend logic threats. Other levels of threats, such as physical drone capture, outdoor GNSS interference, or organizational vulnerabilities, are beyond the current scope. This deliberate focus ensures alignment with regulatory obligations under RED, which emphasize communication and software-driven risks for radio-connected UAVs, rather than physical or organizational threat domains.
The remainder of this paper is organized as follows. Section 2 positions the work within the related literature. Section 3 defines the benchmark ground truth (supporting RQ1 and RQ3). Section 4 details the evaluation setup and controls (enabling RQ2). Section 5 reports findings structured around model-family performance (RQ2), RED article coverage (RQ1), and failure modes (RQ3). Section 6 concludes with direct answers to RQ1–RQ3 and an outlook.

2. Related Work

This section provides an overview of existing research related to the cybersecurity of UAV in recent years. While a significant body of work has addressed UAV security from various angles, including comprehensive surveys on vulnerabilities, threats, countermeasures, and communication network security, a notable gap remains in the domain of automated, scenario-driven threat identification, particularly when aligned with real-world regulatory frameworks. This review categorizes prior efforts, highlighting their contributions and limitations, and sets the stage for the proposed work, which specifically addresses the need for a standardized benchmark and dataset for evaluating LLMs in regulatory-aligned UAV cybersecurity threat identification.

2.1. Surveys on Cybersecurity of UAVs

Several comprehensive surveys have addressed the cybersecurity of UAVs from various perspectives.
Mekdad et al. [14] provide a broad review of UAV security and privacy issues, systematically classifying vulnerabilities, threats, and countermeasures across four levels (hardware, software, communication, and sensor). Their survey dissects common attacks and mitigation techniques at each level, highlighting the myriad of weaknesses in commercial drones and the corresponding defenses. However, while thorough in scope, ref. [14] does not explore the use of advanced AI techniques for automated threat identification in UAV systems.
Tsao et al. in [15] focus on UAV communication networks and the concept of flying ad hoc networks (FANETs). They categorize security threats in FANETs and the Internet of Drones (IoD) infrastructure, examining attacks on links between UAVs, ground stations, and user devices. Their work classifies major threats (e.g., jamming, eavesdropping, spoofing) and defense mechanisms according to network protocol layers, and surveys routing protocol vulnerabilities and countermeasures. This network-centric survey sheds light on UAV-specific network vulnerabilities and privacy issues, but it is largely confined to communication-layer threats and does not consider leveraging language models or AI for threat reasoning.
Bai et al. in [16] offer a similarly comprehensive review of UAS (Unmanned Aerial Systems) cybersecurity, with an emphasis on system architecture and full lifecycle security considerations. They analyze UAV architectures and communication mechanisms, compare UAV systems with industrial control systems to identify unique security challenges, and summarize known vulnerabilities and threats in both cyber and physical domains. Like the above surveys, their work is foundational and identifies many threat vectors, yet it does not involve the creation of a dedicated dataset or the evaluation of Artificial Intelligence (AI) models on UAV-specific threat identification tasks. Building on these broad surveys, some recent works examine how AI methods can bolster UAV security.
Zolfaghari et al. in [17] discuss the security challenges faced by drones, including signal jamming, GPS spoofing, radio link eavesdropping, and other cyber–physical attacks, and outline the great promise of AI in addressing these challenges. They highlight various machine learning techniques (e.g., anomaly detection, computer vision) that could enhance UAV cybersecurity and resilience. However, the survey in [17] remains a high-level overview; it underscores the potential of AI but does not provide empirical analysis of AI systems on regulatory-aligned UAV threat tasks. In a more specialized vein, Sarıkaya and Bahtiyar in [18] survey deep reinforcement learning (RL) applications for UAV security. They review how RL algorithms can be used to improve UAV threat response and autonomous defense, emphasizing the need for learning-based security solutions for drones. This indicates growing interest in advanced AI (like multi-agent RL) to safeguard UAV networks, yet their focus is on control and decision-making aspects rather than textual threat analysis.

2.2. AI-Driven Techniques and Frameworks for UAV Threat Mitigation

Several other studies concentrate on AI-driven techniques (Pre-LLM era) and frameworks for UAV threat mitigation.
Tlili et al. in [19] conduct a comprehensive survey of AI techniques applied to UAV security, covering machine learning and deep learning methods for intrusion detection, communication security, and operational anomaly detection. They provide a taxonomy of AI-based defense mechanisms and identify future research directions in integrating AI into UAV security infrastructure. Likewise, Adil et al. in [20] examine cybersecurity threats in UAV-assisted IoT applications and propose AI-enabled solution approaches. They outline open challenges for secure UAV-IoT integration, such as the need for real-time threat detection algorithms and robust data sharing protocols. While these works recognize AI’s value, they generally discuss algorithmic approaches or conceptual frameworks rather than providing concrete benchmark evaluations.
Research has also progressed toward practical threat detection frameworks tailored to UAV systems. Alturki et al. in [21] propose an intelligent multi-layer framework for UAV threat detection that combines data from satellites, ground IoT sensors, and the UAVs themselves. Their system employs machine learning to detect anomalies across the integrated cyber–physical environment, illustrating a domain-specific application of AI for UAV security (in scenarios like smart grid inspections with drones). Similarly, Miao and Pan in [22] develop a game-theoretic risk assessment model for UAV cybersecurity. They construct attack–defense graphs for representative UAV scenarios and apply a Bayesian–Nash equilibrium analysis to evaluate risk under adversarial conditions. This approach quantitatively assesses how various attack strategies can be countered by defense strategies, providing insights into UAV system vulnerabilities. However, such analytical frameworks and simulation-based studies do not leverage large-scale language understanding; they are focused on specific attack tree models or hardware-in-the-loop simulations, rather than natural-language threat inference.

2.3. Emergence of LLMs in UAV Cybersecurity and Threat Detection

Finally, the emergence of LLMs has started to draw attention in the cybersecurity domain, including for UAV-related security.
Yang et al. in [23] present an extensive overview of UAV safety and security issues and specifically discuss integrating LLMs into UAV security workflows. They review traditional AI (machine learning and deep learning) applications for drone safety (such as computer vision for collision avoidance and intrusion detection systems for networks) and highlight their limitations (e.g., brittleness to adversarial examples, limited adaptability). Yang et al. then identify opportunities where cutting-edge LLMs could contribute, envisioning that the reasoning and knowledge capabilities of LLMs might assist in tasks like scenario-based threat analysis or real-time decision support for UAV operations.
The work in [24] was among the first to apply LLMs for risk assessment in the context of machinery functional safety, a domain closely related to cybersecurity. UAVThreatBench complements this line of research by addressing UAV–OT scenarios with explicit alignment to the RED, extending the evaluation toward UAV-specific threat elicitation. Further, Iyenghar et al. [25] explored Chain-of-Thought (CoT) prompting for LLM-based OT cybersecurity risk assessment using the MITRE EMB3D framework, demonstrating promising alignment with expert evaluations under IEC 62443. UAVThreatBench complements this by targeting UAV–OT scenarios with RED-compliant threat identification, addressing regulatory alignment at the threat elicitation stage rather than post hoc risk scoring. Most recently, Iyenghar [26] presented a comprehensive empirical evaluation of reasoning-capable LLMs in machinery functional safety, highlighting the limits of anthropomorphized reasoning for deterministic safety-critical classification. The dataset used in that study was first introduced in [27], where a structured benchmark for data-driven risk assessment in industrial functional safety was developed. Although situated in the functional safety domain, these works are closely related in their methodological alignment and reinforce the importance of structured, standards-driven evaluation datasets. Even prior to the application of LLMs, native and custom-developed chatbot assistants were employed for functional safety risk assessment. Early work introduced AI-based assistants for determining the required Performance Level of safety functions [28] and later demonstrated chatbot-driven support for reducing design-related risks in machinery development [29]. While these approaches showed potential in supporting engineering decision-making, they were constrained by the absence of structured datasets for reproducible benchmarking. This limitation underscores a key motivation behind the development of benchmark resources such as UAVThreatBench: whether in safety or security domains, reproducibility and standards-based evaluation datasets are essential for systematic progress.
In the broader context, Chen et al. in [30] conduct a survey of LLM applications in cyber threat detection. They find that LLMs have been explored for tasks such as analyzing threat intelligence reports, classifying malware or phishing attempts, and predicting cyber-attacks, demonstrating LLMs’ strong capability to understand and generate security-relevant information. Notably, Chen et al. emphasize that prior to their work there was no systematic review of LLMs in threat detection, indicating how new this research area is. Their survey categorizes recent efforts and concludes that LLMs, with their ability to interpret complex contexts, are promising for various cybersecurity applications. However, none of the above studies specifically target regulatory-aligned UAV threat identification using LLMs.
Recent efforts such as SECURE [31] and CyberSecEval 2 [32] further highlight the growing need for domain-specific LLM evaluation in cybersecurity. SECURE focuses on industrial control system (ICS) security, providing six curated datasets for tasks such as vulnerability comprehension, and reasoning over security advisories. While it shares UAVThreatBench’s emphasis on expert curation and safety-critical infrastructure, it does not address UAV–OT deployments or regulatory mapping under the RED. In contrast, CyberSecEval 2 benchmarks LLMs from the perspective of their own security posture, testing resilience to prompt injection, exploit synthesis, and interpreter abuse. This complements UAVThreatBench by evaluating model vulnerabilities rather than their capability to act as security analysts. Taken together, these benchmarks underscore the need for specialized evaluations: UAVThreatBench is unique in aligning threat identification tasks with RED Articles 3.3(d–f) and UAV–OT scenarios, while future work could incorporate SECURE-style factual QA and CyberSecEval-style robustness testing to extend coverage.
In summary, prior work has laid a solid foundation in identifying UAV cyber threats and even suggested AI-based defenses, but a gap remains in automated, scenario-driven threat reasoning aligned with real-world regulations. No existing study provides a structured dataset of UAV operational scenarios annotated with cybersecurity threats, nor do they empirically benchmark state-of-the-art LLMs on understanding and identifying those threats. The related work so far either surveys general UAV security challenges or proposes AI methods in isolation.
This work addresses this gap by introducing UAVThreatBench, the first dataset and benchmark focused on evaluating LLMs for UAV cybersecurity threat identification under regulatory criteria (e.g., EU Radio Equipment Directive compliance). In doing so, we build upon the surveyed research but go further by providing an open, standardized way to assess how well modern LLMs can infer security threats from UAV scenario descriptions, a task that the previous literature has not yet examined. This contribution of a regulation-aligned threat identification benchmark is unique and responds to the open challenge of applying advanced AI (LLMs) to enhance cybersecurity risk assessment in UAV systems, as identified but not realized in the existing related work.

3. UAVThreatBench Dataset

UAVThreatBench is a structured dataset primarily developed to evaluate LLMs in the context of cybersecurity threat identification within industrial UAV and OT interaction scenarios. It synthesizes realistic cyber-physical use cases reflecting UAV deployments across industrial environments, including indoor warehouses and production zones, where drones perform tasks such as inventory scanning and automated parts delivery over wireless networks. These operational roles were selected as they represent some of the most prevalent and security-relevant applications of UAVs in OT-integrated industrial settings. The dataset creation pipeline consists of two clearly delineated stages: (i) automated generation of scenarios based on a predefined combinatorial template, and (ii) expert-driven manual curation of threats aligned with regulatory frameworks.
To contextualize the structured scenario generation, Figure 2 illustrates the UAV–OT system architecture underlying UAVThreatBench. A UAV acts as an edge device at the OT perimeter, exchanging control, telemetry, and process data with backend systems via wireless or dock-based communication. The architecture is organized into three layers: (i) the UAV subsystem with sensors, flight control, telemetry, and firmware management; (ii) the communication layer supporting Wi-Fi, Long-Term Evolution (LTE) and fifth-generation (5G) LTE/5G, proprietary RF, and wired docking/diagnostic interfaces; and (iii) backend OT enterprise systems, including Manufacturing Execution Systems (MESs), Programmable Logic Controllers (PLCs), Automated Guided Vehicle (AGV) controllers, and cloud services. The modeled attack surfaces correspond to Articles 3.3(d)–(f) of the EU RED directive, covering wireless interfaces, maintenance ports, and backend access control. This abstraction directly aligns with the structured scenario dimensions summarized in Table 1, ensuring consistency between the architectural model and the generated benchmark scenarios.

3.1. Stage I: Scenario Generation

In the first phase, a corpus of plausible UAV–OT cyber-physical interaction scenarios is generated via combinatorial synthesis using an automated script. A structured scenario template is populated by systematically combining values from six key dimensions that reflect realistic operational contexts in indoor industrial environments: (i) UAV operational roles, (ii) interacting OT systems, (iii) communication protocols, (iv) data flow functions, (v) cybersecurity origin classes, and (vi) attack consequences. Each dimension is intentionally selected based on observed deployment configurations in warehouse logistics and production systems, and was informed by security categories and threat classes reflected in standards such as ISO 21384-3 [6], IEC 62443 [7], and ETSI EN 303 645 [33].
Prior research on UAV deployments in logistics and inspection confirms the prevalence of these roles and communication patterns in practice [14,15,17,34,35,36,37]. For example, drones are frequently used for inventory scanning and parts delivery in warehouse and factory environments [36,37], and research on UAV swarms in indoor industrial settings supports the plausibility of communication patterns and data flows represented in the dataset [37]. These example sources validate that the resulting scenarios are both technically feasible and representative of real-world UAV use in OT-integrated industrial contexts. While the present dataset emphasizes UAV use in indoor industrial contexts, reflecting both observed prevalence and the author’s expertise, broader UAV application domains (e.g., outdoor Global Navigation Satellite System (GNSS)-based navigation, critical infrastructure inspection) are mentioned as potential threats to validity of the findings (Section 5.6).
To maintain readability and coherence, the following structured dimensions are presented as bullet points, but each directly reflects how UAV–OT operational contexts were parameterized in the dataset synthesis.
  • UAV Roles:
    Inventory Management Drone—used for real-time stocktaking, inventory scanning, and warehouse mapping.
    Automated Parts Delivery Drone—facilitates point-to-point transport of components across production zones.
  • Interacting OT Components:
    Manufacturing Execution System (MES) Server—orchestrates production operations and synchronizes UAV data.
    AGV Fleet Control System—enables coordination between ground-based mobile robots and UAVs.
    Industrial Wi-Fi Access Point—primary wireless gateway for telemetry and real-time data exchange.
    Automated Storage and Retrieval System (AS/RS)—interacts with UAVs for automated materials handling.
    Industrial Control System (ICS) PLC—governs low-level interfaces such as docking stations or firmware uploads.
  • Communication Protocols:
    Wi-Fi Protected Access 2 (WPA2)-Enterprise Wi-Fi—standard secure wireless network for UAV telemetry and control.
    Private LTE/5G Network—offers higher bandwidth and low latency, typically for mission-critical data flows.
    Proprietary RF Link—used in custom deployments where standard protocols are unsuitable.
    Wired Ethernet (via docking station)—employed for high-integrity operations such as firmware upgrades.
    Bluetooth Low Energy (LE)—used for proximity-based diagnostics or field pairing.
  • Data Flow Functions:
    Real-time inventory data upload;
    Flight path updates and control signals;
    Automated firmware updates;
    Diagnostic logs transfer;
    Sensor readings (temperature, pressure);
    Assembly instructions download.
  • Cybersecurity Origins and Attack Vectors: Each scenario is linked to one of five high-level categories that represent common attack surfaces in UAV–OT systems:
    Network Interface (e.g., Unauthorized Wi-Fi Access, Insecure Ethernet Port);
    Communication Link (e.g., Unencrypted LTE/5G Link, GNSS Spoofing);
    Software/Firmware (e.g., Insecure Firmware Update Mechanism, Weak Credentials);
    Physical Access Point (e.g., Unsecured USB Port, Exposed Diagnostic Interfaces);
    Backend System Interaction (e.g., Insecure APIs to MES/WMS, Cloud Service Vulnerabilities).
Each origin category includes a constrained set of 2–3 attack vectors, each mapped to specific consequences such as data exfiltration or mission disruption. These mappings are informed by known vulnerability patterns in OT systems and harmonized regulatory frameworks (e.g., RED [2], ETSI EN 303 645 [33]).
  • Potential Consequences: Each attack vector is associated with 2–3 plausible outcomes, grounded in the cybersecurity literature and standards (e.g., ETSI EN 303 645, IEC 62443). Examples include the following:
    Data Exfiltration, Remote Code Execution, Command Injection;
    Mission Disruption, Unauthorized System Control, Inventory Manipulation.
To ensure semantic validity and domain realism, a plausibility mapping is applied to constrain scenario generation to only technically feasible combinations across four key dimensions, namely, UAV roles, OT systems, communication protocols, and data functions. For instance, Firmware Updates are permitted only over Wired Ethernet, Private LTE/5G, or WPA2-Enterprise, and explicitly disallowed over Bluetooth LE. Likewise, specific UAV roles are associated with context-appropriate tasks, for example, the “Inventory Management Drone” is limited to data uploads and sensor reading functions.
The complete mapping logic is summarized in Table 1, which enumerates all valid cross-dimensional combinations. This constraint-driven synthesis ensures that the generated interactions are technically sound and reflective of real-world industrial deployments in warehouse and production environments.
As a result, 924 structured UAV–OT cyber-physical interaction scenarios are generated in Stage I. Each scenario conforms to a predefined JSON schema, including metadata fields for environment context, UAV role, OT component, communication protocol, data function, attack origin, and its associated technical consequences. The "Expected Threats" field is included but deliberately left unpopulated at this stage to allow for subsequent expert annotation.
In Stage II, domain experts manually populate this field with five credible threats mapped to RED Articles (d), (e), or (f). This upper bound of five threats was established to maintain dataset manageability and focus, acknowledging that a scenario might encompass a broader range of potential threats. An illustrative JSON example of a curated, fully annotated scenario entry in the UAVThreatBench dataset is shown in Listing 1. The example demonstrates the structure of each field and shows the mapping of five expert-defined threats to RED Articles 3.3(d)–(f) in the “Expected Threats” field.
Listing 1. Example of a curated UAV–OT cybersecurity scenario with corresponding expected threats and mapping to RED Articles.
Drones 09 00657 i001

3.2. Stage II: Human-in-the-Loop Threat Curation

While Stage I generates structured UAV–OT interaction scenarios with defined metadata fields and plausible cybersecurity origins, the “Expected Threats” field (cf. Listing 1) remains deliberately unpopulated during synthesis. In Stage II, each scenario undergoes expert review to transform this syntactic template into a semantically grounded dataset.
  • For each scenario, five distinct threats are identified. These threats are tailored to the attack vector and operational context of the scenario.
  • Each threat entry consists of a concise textual description of the expected attack consequence (e.g., “replay attacks using captured LTE/5G data”) and a mapping to the corresponding RED Article, namely 3.3(d) (network protection), 3.3(e) (data/privacy protection), or 3.3(f) (fraud prevention). The manual annotation process ensures alignment with harmonized cybersecurity requirements for radio-connected industrial devices [2,33].
The curated Expected Threats field thus serves as the ground truth for benchmarking, enabling evaluation of model-generated threats. Thus, UAVThreatBench fills a critical gap in the benchmarking ecosystem, offering a semantically rich dataset for cybersecurity evaluation in safety-critical, real-world deployments.
The five threats per scenario were selected according to expert criteria: (i) Typicality and Validity: each threat had to be representative of known UAV–OT attack patterns and technically feasible given the scenario context and documented vulnerabilities; (ii) Comprehensiveness: the set of threats collectively covered distinct aspects of the attack surface (confidentiality, integrity, availability, and control); and (iii) Relevance: each threat aligned with real-world attacker goals and was mappable to one of the RED Articles 3.3(d), (e), or (f).
Threats were initially curated by a cybersecurity and OT systems expert from innotec GmbH–TÜV Austria Group [38] (also the author of this work). A second expert reviewed a subset of annotations to identify semantic mismatches or ambiguities. Final entries were consolidated through discussion, ensuring clarity, regulatory alignment, and minimal redundancy.
The curated threats are not intended to imply equal likelihood or severity; rather, each set of five was selected to collectively span distinct aspects of the scenario’s attack surface. Each entry is expressed in a concise one-sentence form to serve as a canonical threat handle for benchmarking, ensuring evaluation possibility across models. In practice, compliance documentation would elaborate these threats into more detailed narratives and evidentiary statements. Thus, the curated set provides a representative but non-exhaustive ground truth for structured benchmarking, not a verbatim template for regulatory submissions, and should be interpreted as a representative baseline rather than a definitive or exhaustive enumeration.
UAVThreatBench delivers a high-fidelity, expert-curated dataset for robust cybersecurity evaluation in industrial UAV-OT environments. Beyond LLM benchmarking, it facilitates comprehensive model evaluation, including threat inference, robustness, and generalization analysis. The dataset also serves as a crucial resource for pretraining specialized cybersecurity LLMs and generating regulatory-aligned test scenarios for compliance.
To ensure transparency, reproducibility, and broad community adoption, the complete UAVThreatBench dataset, comprising annotated ground truth scenarios, LLM-generated outputs, and evaluation logs, is made publicly available under the permissive MIT License via a GitHub repository [39]. This open-access distribution facilitates reuse, extension, and integration into future research on UAV cybersecurity and RED-compliant risk assessment.

4. Experimental Design

This section presents the experimental design, model configurations, evaluation methodology, for benchmarking LLMs for cybersecurity threat identification using the UAVThreatBench dataset.

4.1. Model Selection and Configuration

To evaluate the capabilities of LLMs in structured cybersecurity threat identification for UAV–OT scenarios, we selected seven models across two prominent model families: OpenAI’s GPT series and Meta’s LLaMA-3 instruction-tuned variants. These models reflect a spectrum of architectural scales, instruction tuning strategies, and deployment modalities.
  • OpenAI Models: GPT-3.5 Turbo 16k, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano, and GPT-4o [40,41,42]. These represent the latest production-grade models from OpenAI with varying inference costs and capabilities. While GPT-3.5 Turbo 16k serves as a strong low-cost baseline, GPT-4.1 and GPT-4o are state-of-the-art models with high reasoning performance. The GPT-4.1 Mini and GPT-4.1 Nano variants offer reduced latency and lower compute profiles, making them relevant for resource-constrained industrial deployments. All OpenAI models were accessed via the official OpenAI API under a commercial license with controlled temperature and token settings.
  • LLaMA-3 Variants: LLaMA-3.1-SauerkrautLM-70B-Instruct and LLaMA-3.3-70B-Instruct [43,44]. These open-source instruction-tuned variants of Meta’s LLaMA-3 70B model family are adapted for multilingual and domain-specific alignment, optimized for complex task-following capabilities. The SauerkrautLM variant incorporates fine-tuning for longer-form, structured outputs and security-relevant instructions. These models were accessed through the Academic Cloud OpenChat API, supporting reproducible benchmarking outside closed commercial ecosystems.
The rationale behind model selection is threefold:
1.
Architectural Coverage: By including models ranging from ∼20 B (OpenAI mini/nano estimates) up to 70B parameters (LLaMA-3), we ensure the evaluation spans both lightweight deployment targets and large-scale reasoning systems.
2.
Deployment Diversity: OpenAI models represent high-availability, production-ready cloud APIs, while LLaMA-3 variants represent community-accessible open-source options, enabling deployment flexibility in air-gapped or regulated OT environments.
3.
Instruction-Tuning Relevance: All selected models are instruction-tuned and capable of structured output generation, which is critical for consistent JSON-formatted threat responses as required by the benchmark prompt format.
All models were evaluated via standardized chat-completion endpoints, with a fixed temperature of 0.7, a maximum token limit of 400, and uniform response formatting enforced through prompt instructions (see Section 4.2). This consistent configuration isolates the effect of model architecture and instruction-following fidelity, minimizing the influence of decoding hyperparameters. The inclusion of multiple GPT-4 variants (standard, mini, nano, omni) allows assessment of performance scaling within a single architectural family, while LLaMA-3 variants provide a meaningful open-source comparison point. It is worth noting that a temperature of 0.7 introduces moderate stochasticity in decoding, which may account for observed run-to-run variation in LLM benchmarks. Provider documentation confirms that even at temperature = 0.0, outputs remain only “mostly deterministic” due to backend and inference-level effects [45,46,47]. Recent analyses show that 3–5% variability can persist even under such deterministic settings [48], and it is therefore reasonable to expect similar levels of variability in the results reported here. While we fixed temperature at 0.7 in the present evaluation, future work will replicate selected runs with temperature = 0.0 to directly assess the impact of decoding non-determinism on UAVThreatBench results. This configuration isolates model-family and instruction-tuning effects required to answer RQ2.

4.2. Prompting Strategy

The LLM benchmarking pipeline uses a system–user role format based on the standard chat-completion schema [49], wherein the “system” message defines the LLM’s functional role and behavioral constraints, and the “user” message provides the concrete task input. This modular structure aligns with best practices for reproducible LLM evaluation.
For the experiments, the system prompt conditions the model as a cybersecurity expert specializing in UAV and OT threats. The user prompt injects a natural language UAV–OT scenario based on the threat elicitation from the UAVThreatBench dataset. The complete, consistent, structured prompt as seen below was used for all models. Each UAV–OT scenario (e.g., scenario_description) entry from a scenario in UAVThreatBench was presented as an input. The model was instructed to return exactly two cybersecurity threats relevant to the scenario context. Additional prompt instructions required RED Article compliance, concise phrasing, and realistic threat representation. Control parameters were fixed.
This study employs a zero-shot prompting strategy, described above, with strict JSON output formatting and a fixed threat count (two per scenario). This approach directly assesses the LLMs’ inherent ability to infer and categorize cybersecurity threats from novel scenarios without in-context examples, while simultaneously ensuring the structured output necessary for robust automated evaluation and reproducibility. The constraint on threat quantity specifically evaluates the models’ capacity for salient threat identification within the given context.
While alternative prompting strategies, such as few-shot learning, CoT, and Retrieval-Augmented Generation (RAG), hold potential for further enhancing LLM performance, the primary objective of this study was to establish a baseline evaluation of the UAVThreatBench dataset’s efficacy for threat inference in a zero-shot context. Subsequent research will systematically investigate the impact of these sophisticated prompting techniques on model performance within this domain.
Drones 09 00657 i002
Note that the master prompt does not reproduce the full legal text of RED Articles 3.3(d)–(f). Instead, simplified operational labels such as (d) network/device integrity, (e) data/privacy protection, and (f) fraud prevention are used, as seen above, to guide model outputs. This design choice was made to reduce verbosity, avoid confusion from legal phrasing, and ensure consistent JSON formatting across models. While this abstraction improves reproducibility, future work may explore prompts including the verbatim RED definitions to assess potential gains in legal alignment.

4.3. Evaluation Metrics

Model-generated outputs were post-processed to extract discrete cybersecurity threats per scenario. These were evaluated against expert-defined "Expected Threats", which serve as the ground truth reference for each scenario.
This study evaluated the ability of seven LLMs to generate cybersecurity threats for 100 randomly selected UAV–OT scenarios drawn from the UAVThreatBench dataset. To rigorously assess the quality of the LLM-generated threats, their outputs were compared against an expert-defined ground truth for each scenario in the dataset [39]. This comparison employed a fuzzy matching algorithm, specifically the token_set_ratio function [50], which was implemented as a Python 3.10 script. Note that a sample size of 100 was chosen as a tractable yet diverse subset, ensuring coverage of UAV roles, OT components, communication protocols, and attack origins, while maintaining computational feasibility for multi-model, multi-run benchmarking.
The token_set_ratio algorithm is particularly suitable here as it measures the similarity between two text strings by considering the proportion of shared words (tokens), regardless of their order. This approach is crucial for evaluating LLM outputs, as models can express the same threat concept using varied phrasing while still conveying the correct meaning. A similarity threshold of 70 (on a scale of 0 to 100) was set; any LLM-generated threat achieving this score or higher with a ground truth threat was considered a match.
Note that the 70% threshold for the token_set_ratio was empirically determined and refined through qualitative assessment. This threshold balances precision and recall, ensuring that LLM-generated threats were considered a match only when they demonstrated substantial semantic and lexical overlap with the expert-curated ground truth. This value was found to reliably capture conceptual equivalence while accommodating the linguistic variability inherent in LLM outputs, thus preventing an overly strict evaluation that might penalize valid rephrasing.
Based on this comparison, the threat generation performance was assessed per scenario using three distinct match categories:
  • Perfect Match: Defined as case- and punctuation-insensitive equivalence between an LLM-generated threat and a ground truth threat.
  • Partial Match: Based on high lexical and semantic similarity, where an LLM-generated threat achieved a token_set_ratio score of ≥70% with a ground truth threat. This category captures instances where the core meaning was conveyed despite variations in phrasing.
  • No Match: Indicates that an LLM-generated threat did not achieve the minimum similarity threshold with any of the expert-defined ground truth threats for a given scenario. A detailed breakdown of No Match cases into non-generation versus mismatch outcomes is provided in Appendix A.
Each scenario was evaluated based on the number of model-generated threats that correctly matched expert-defined threats (via fuzzy matching threshold ≥ 70). Scenarios were classified into three categories: Perfect Match (≥2 threats matched), Partial Match (1 threat matched), and No Match (0 threats matched). Additionally, the semantic correctness of RED Article attribution was assessed for each threat.
To compare LLM performance, the following evaluation metrics were computed:
  • Time Efficiency: Mean generation time per scenario, per model, which is computed from execution logs to assess latency and scalability.
  • Threat Match Rate: Proportion of model-generated threats that matched ground truth threats across all scenarios (either fully or partially).
  • RED Article Compliance: Fraction of correctly matched threats that also aligned with the corresponding RED Article label in the ground truth.
  • Scenario Classification Distribution: Percentage of scenarios falling into each of the three evaluation classes: Perfect, Partial, or No Match.
This combined evaluation framework enables both fine-grained (per-threat) and coarse-grained (per-scenario) performance analysis, supporting comparative benchmarking across model variants.

5. Results and Discussion

This section presents the empirical results of the evaluation of seven LLMs on the UAVThreatBench dataset, focusing on time efficiency, semantic threat accuracy, and compliance with RED Articles (d), (e), and (f). Both quantitative metrics and scenario-level qualitative analyses are used to assess each model’s ability to generate precise, regulation-aligned cybersecurity threats for UAV–OT deployments. The findings are organized to directly address the research questions outlined in Section 1.1, with results presented in terms of RED Article coverage (RQ1), model-family comparisons (RQ2), and systematic errors and failure modes (RQ3).

5.1. Time Efficiency

Figure 3 illustrates the average threat generation time per scenario across all evaluated models. Notably, the LLama-3.1-SauerkrautLM-70b-Instruct model exhibited the highest latency, averaging over 21 s per scenario. This was significantly slower than all other models, including its successor LLaMA-3.3-70B-Instruct, which achieved a substantially lower average inference time of approximately 6.5 s. The drastic discrepancy suggests possible implementation inefficiencies, suboptimal backend optimization, or system-level throttling specific to the LLama-3.1-SauerkrautLM-70b-Instruct model deployment.
Among the OpenAI models, the GPT-4.1-mini, GPT-4o, and GPT-4.1 variants demonstrated comparable runtimes, averaging between 2.5 and 3 s. The lightweight GPT-4.1-nano and GPT-3.5-turbo-16k models outperformed all others in terms of generation speed, completing the task in approximately 1.5 to 2 s per scenario on average. These results confirm that smaller model variants provide relative advantages in latency compared to larger models, although this often comes at the cost of reduced accuracy.
Overall, the time efficiency analysis reveals a trade-off between model size and response latency. Reported latencies reflect both model decoding speed and provider infrastructure/network effects (e.g., commercial versus academic APIs). They are therefore environment-dependent rather than purely model-intrinsic. Importantly, these latencies are acceptable for analyst-assisted or batch-style UAV–OT risk assessments, which is the intended use case of UAVThreatBench. They are not designed for strict real-time UAV control, and this limitation is explicitly acknowledged in Section 5.6.

5.2. Model Threat Evaluation Summary

This subsection addresses RQ2 by comparing model families using Perfect, Partial, and No-Match rates in conjunction with RED-tag consistency. The evaluation assessed the ability of seven LLMs to generate cybersecurity threats for 100 randomly selected UAV–OT scenarios from the UAVThreatBench dataset. Model outputs were compared against expert-defined ground truth threats using a fuzzy string matching algorithm (token_set_ratio, threshold = 70). Scenario-level performance was then categorized into three outcome classes: Perfect Match, Partial Match, and No Match.
  • Perfect Match: At least two model-generated threats match expert-defined threats for a scenario with semantic similarity score ≥70 and the correct RED Article tag (d, e, or f).
  • Partial Match: Only one threat meets the semantic similarity threshold (≥70) or matches semantically but has a missing or incorrect RED tag.
  • No Match: None of the model-generated threats have a semantic similarity score ≥ 70 with any ground truth threat, or all plausible threats lack the correct RED Article.
Figure 4 summarizes model performance across all scenarios. GPT-4o achieves the highest Perfect Match (46%) and Total Match (87%), followed by GPT-3.5 Turbo 16k and GPT-4.1 Mini (both with 79% Total Match, though lower Perfect Match). Open-weight LLaMA models such as LLaMA-3.1 SauerkrautLM 70B and LLaMA-3.3 70B Instruct demonstrate lower alignment (Perfect Match 15–19%), while GPT-4.1 Nano performs the weakest with 56% of scenarios unmatched.
While No Match rates varied across models, it is important to note that these outcomes arise from two different failure modes: either the model produced threats that did not overlap with the expert-defined ground truth (mismatch), or the model failed to generate any threats at all (non-generation). A detailed breakdown of these cases is provided in Appendix A, which shows that high-performing models (e.g., GPT-4o) consistently generated outputs, whereas weaker models (e.g., GPT-4.1 Nano) frequently failed to generate threats despite the prompt requirement.
To better understand model behavior, we present two representative scenario evaluations below. These illustrate the semantic reasoning quality of model outputs and their alignment with RED Articles (d), (e), and (f).

5.2.1. Semantic Threat Evaluation: Scenario CS-OT-UAV-0027

Scenario CS-OT-UAV-0027 involves an Inventory Management Drone communicating with a Manufacturing Execution System (MES) server over a private LTE/5G network. The attack surface arises from an insecure API between the UAV and the MES/WMS, exposing backend interaction risks. The expert-defined ground truth specifies five threats:
Threat 1:
 Unauthorized access via insecure API (RED (d));
Threat 2:
 Interception of real-time inventory data (RED (e));
Threat 3:
 Backend MES/WMS manipulation (RED (f));
Threat 4:
 Denial-of-service (DoS) on the MES server (RED (d));
Threat 5:
 Theft of sensitive data from insecure backend interactions (RED (e)).
Table 2 reports the semantic alignment of all seven evaluated models against these ground truth threats. Six of seven models (GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-3.5-turbo-16k, LLaMA-3.1, and LLaMA-3.3) consistently produced correct matches for Threats 1 and 2. Several also produced partial overlap with Threat 3 (backend manipulation). None identified Threat 4 (availability/DoS) or Threat 5 (backend data theft). GPT-4.1-nano failed to generate any valid threats.
The evaluation highlights the following systematic patterns:
  • High recall for access and confidentiality threats: Six of seven models correctly identified unauthorized API exploitation (Threat 1, RED (d)) and data interception (Threat 2, RED (e)), reflecting robust recognition of surface-level API and communication risks.
  • Systematic omission of availability risks: None of the models captured the DoS attack on the MES server (Threat 4), underscoring a common blind spot in reasoning about availability and resilience within UAV–OT contexts.
  • Partial backend logic awareness: Several models (GPT-4.1, GPT-4.1-mini, GPT-3.5, LLaMA-3.1, LLaMA-3.3) included API exploitation narratives that partially overlap with backend manipulation (Threat 3), but these lacked explicit reference to backend logic alteration or process state integrity.
  • Neglect of backend data theft: No model produced explicit references to backend data theft (Threat 5), suggesting underrepresentation of confidentiality risks beyond communication-level interception.
  • Explicit RED compliance: Except for GPT-4.1-nano, all models explicitly mapped identified threats to the correct RED Articles (d, e, f), demonstrating their ability to align threat reasoning with regulatory categories when semantic matches were found.
Overall, GPT-4o and GPT-4.1 provided the most precise semantic matches, LLaMA-3.3 demonstrated broader but less specific coverage, and GPT-4.1-nano failed entirely on this scenario. The systematic failure across models to recognize availability (DoS) and backend-oriented data theft threats highlights a domain-wide limitation: current LLMs tend to underrepresent resilience-related and backend-centric vulnerabilities compared to access-control and confidentiality threats.

5.2.2. Semantic Threat Evaluation: Scenario CS-OT-UAV-0020

Scenario CS-OT-UAV-0020 involves an Inventory Management Drone communicating with a Manufacturing Execution System (MES) server over a private LTE/5G network. The attack surface arises from RF jamming of the communication link, which disrupts data exchange and potentially exposes backend vulnerabilities. The expert-defined ground truth for this scenario specifies five threats:
Threat 1:
 RF Jamming of the Private LTE/5G Network disrupting drone–MES communication (RED (d));
Threat 2:
 Unauthorized interception of real-time inventory data (RED (e));
Threat 3:
 Communication spoofing leading to incorrect MES updates (RED (f));
Threat 4:
 Malware injection via compromised network (RED (d));
Threat 5:
 Unauthorized command injection affecting inventory integrity (RED (f)).
Table 3 presents the semantic alignment of all seven evaluated models against these ground truth threats. Across models, all consistently identified T1 (RF jamming) and T2 (data interception), but none produced outputs that matched T3–T5.
The evaluation highlights a consistent pattern across all models:
  • Strong recognition of connectivity threats: All models successfully identified the RF jamming attack (T1) and interception of inventory data (T2), confirming robust alignment with surface-level communication risks.
  • Failure to capture deeper exploitation pathways: None of the models identified spoofing, malware injection, or command injection (T3–T5), indicating systematic underperformance in reasoning about backend or multi-stage attacks.
  • Explicit RED compliance: All valid matches were explicitly mapped to the correct RED Articles, demonstrating that RED-aware alignment works reliably when lexical and semantic overlap exists.
Overall, the scenario represents a domain-wide limitation: models exhibit strong detection of connectivity-based risks (availability and confidentiality) but systematically overlook integrity-related backend threats.
Together, the scenario-level evaluations of CS-OT-UAV-0027 (backend/API exploitation) and CS-OT-UAV-0020 (RF jamming and connectivity disruption) illustrate complementary aspects of model behavior: strong recognition of surface-level access and communication threats, but systematic omissions in availability and backend integrity domains. These qualitative analyses provide a deeper understanding of where models succeed and fail in RED-aligned reasoning. A quantitative comparison across all 100 benchmark scenarios is presented in the next section, highlighting broader architectural trends and aggregate performance differences.

5.3. Model Performance: Quantitative Summary

This evaluation provides a comparative assessment of seven LLMs in generating cybersecurity threats for 100 UAV–OT scenarios, benchmarked against expert-defined ground truth. Figure 4 summarize results based on Perfect Match, Partial Match, and No Match rates, derived from fuzzy matching thresholds (token set ratio ≥ 70).
  • GPT-4o demonstrated the strongest overall performance, achieving the highest Perfect Match rate (46%) and Total Match rate (87%), reflecting both syntactic precision and semantic completeness in RED-compliant threat generation.
  • GPT-3.5-turbo-16k followed with a Total Match rate of 79% and a respectable Perfect Match rate (35%), showing consistent performance despite being a smaller, more cost-efficient model.
  • LLaMA-3.3 and LLaMA-3.1 SauerkrautLM exhibited competitive Partial Match rates (44–57%) but much lower Perfect Match rates (15–19%), indicating the generation of semantically relevant but less precisely aligned outputs.
  • GPT-4.1-nano performed the weakest, with only 44% Total Match and the highest No Match rate (56%), suggesting significant limitations in understanding or retrieving RED-relevant threats.
These findings demonstrate a consistent performance gradient shaped by both model architecture and instruction tuning. OpenAI’s larger instruction-tuned models (e.g., GPT-4o) substantially outperform open-weight LLaMA variants in generating regulation-aligned cybersecurity threats. While LLaMA models often capture superficial threat semantics, they frequently miss deeper context, such as backend logic, availability-related vectors, and clause-specific RED mappings, highlighting a gap in OT-aware threat reasoning. These results suggest that architectural scale and fine-tuning with regulatory context (as in GPT-4o) are key enablers of performance in structured cybersecurity assessment tasks.

5.4. RED Article-Level Analysis and Failure Modes

This subsection addresses RQ1 and RQ3 by quantifying article-level coverage of RED (d), (e), and (f) and analyzing systematic omissions and misclassifications. As clarified in Figure 4, Perfect and Partial matches required both semantic overlap and correct RED category alignment. Building on this, the per-article analysis below highlights that RED (e) threats were consistently recovered, while RED (d) and RED (f) exhibited systematic omissions and misclassifications. To assess regulatory alignment, threats generated by each model were categorized by their correspondence to RED Articles (d), (e), and (f). All models performed strongly for RED Article (e) (personal data/privacy), with match rates exceeding 80%. However, performance declined significantly for RED Article (d) (network and device integrity) and RED Article (f) (protection against fraud and economic harm), particularly for threats related to service disruption (e.g., DoS) and backend-layer exploitation (e.g., MES/WMS manipulation).
In addition to under coverage, multiple models exhibited systematic failure patterns:
  • Redundancy: Threats were occasionally duplicated using paraphrased text, reducing semantic diversity.
  • Misclassification: Some threats were incorrectly mapped to RED Articles, for instance, data exfiltration mislabeled under RED (f) instead of RED (e).
The experimental results suggest that the LLMs assessed in this study tend to rely on surface-level lexical cues, with only limited internal representations of RED clause semantics or architectural context. Enhancing performance may require retrieval-augmented prompting with RED definitions, structured threat-taxonomy guidance, or clause-specific supervision during fine-tuning.

5.5. Identified Gaps and Recommendations

This subsection synthesizes the experimental findings across RQ1–RQ3, highlighting failure patterns and remediation strategies. The bullet points below summarize the key gaps revealed by each research question.
  • RQ1: Can LLMs generate cybersecurity threats that align with RED Articles (d), (e), and (f)?
    As shown in the RED Article-level evaluation, most models reliably identify threats aligned with RED (d) and (e), primarily access control and data confidentiality concerns. However, performance significantly drops for RED (f), where threats involve backend logic manipulation or financial fraud. This suggests partial regulatory alignment: while LLMs are competent in recognizing common threat types, they struggle with domain-specific semantics, such as MES/WMS disruption or fraud scenarios, which are essential for RED (f) compliance.
  • RQ2: How do different LLM families perform across UAV–OT scenarios?
    OpenAI models (GPT-4o, GPT-3.5-turbo) consistently outperformed open-source LLaMA variants in both perfect and partial threat matching. GPT-4o achieved 87% total match and 46% perfect match, while LLaMA models peaked at 57% partial match with only 15–19% perfect match. These results demonstrate a clear gradient of performance linked to instruction tuning and architectural scale. Open-weight LLMs can produce broadly plausible threats but frequently lack operational specificity and RED Article granularity.
  • RQ3: What types of systematic errors or omissions occur in LLM-based threat generation, and how do they affect RED alignment?
    This evaluation surfaced several recurring failure patterns: (i) omission of availability threats such as DoS (RED (d)); (ii) limited representation of backend-layer risks (RED (f)); (iii) misclassification of threats across RED categories; and (iv) generation of redundant or semantically shallow outputs. These omissions directly affect the completeness and accuracy of RED-compliant threat generation, highlighting the models’ insufficient grasp of OT-specific reasoning and layered threat semantics.
These insights reveal critical gaps in the capabilities of the LLMs evaluated in this study for RED-aligned UAV cybersecurity.
  • Threat type bias: Most models display a consistent bias toward threats impacting confidentiality and integrity, with significant underrepresentation of availability threats (e.g., DoS), despite their relevance under RED Article (d).
  • Backend semantics: Threats involving backend logic, MES/WMS manipulation, or system-level economic consequences (RED (f)) are rarely identified, even by high-performing models. This points to a lack of domain grounding in industrial OT interactions.
  • Regulatory mapping errors: Some threats were misclassified, e.g., privacy violations incorrectly labeled as RED (f), indicating incomplete understanding of RED clause semantics.
To improve LLM-based threat modeling in RED-regulated domains, the following recommendations are proposed:
  • Structured benchmarks: Future evaluations should explicitly assess LLM coverage across the full CIA triad (confidentiality, integrity, availability) and all RED Articles (d), (e), and (f).
  • Domain-grounded prompting: Prompts should include OT-specific cues (e.g., WMS logic, industrial protocols, backend operations) to guide models toward deeper contextual threat reasoning.
  • RED-aware evaluation: Validation frameworks should include clause-specific threat mapping to detect misclassification and assess clause-level recall/precision.
  • Retrieval-augmented reasoning: Incorporating RED clause definitions, threat taxonomies, and example mappings through retrieval-based prompting may help mitigate hallucination and regulatory misalignment.

5.6. Threats to Validity

The following limitations and validity threats, which qualify the external and construct validity of answers to RQ1–RQ3 (e.g., indoor UAV–OT scope and matching-threshold sensitivity), were identified in the design, annotation, and evaluation phases of the UAVThreatBench benchmark:
  • Expert Annotation Validity: Ground truth threats were initially curated by a cybersecurity consultant and domain expert (affiliated with innotec GmbH-TÜV Austria Group [38], and author of this work), who led the Stage II human-in-the-loop curation process. To improve semantic precision and regulatory alignment, a second expert independently reviewed a subset of scenarios, with joint validation sessions used to resolve ambiguities. The final annotations were consolidated following these discussions and serve as a practical reference for evaluation rather than the only correct set; other plausible threats may exist. While this process improves annotation robustness, future releases will include formal inter-annotator agreement metrics and blind dual annotation, along with priority/severity metadata to capture differential relevance among threats.
  • Semantic Threshold Sensitivity: The fuzzy matching threshold of 70% (token_set_ratio) was empirically set to balance paraphrase tolerance with specificity in short threat statements. While effective in capturing conceptual equivalence, different thresholds may shift performance outcomes. Future work will include systematic sensitivity analyses to quantify robustness of results across threshold values. The curated threats are treated as canonical handles for benchmarking, enabling consistent comparison across models. In real-world audits, however, risk assessment requires tolerance for variability and incomplete reasoning, as auditors consider whether generated threats are directionally plausible rather than exact replicas. This distinction highlights that UAVThreatBench measures structured alignment under controlled conditions, while practical deployment must account for margins of fault tolerance when interpreting LLM outputs.
  • Scenario Domain Coverage: The dataset focuses on UAV roles in indoor industrial contexts (inventory and delivery). This does not capture outdoor UAV use cases (e.g., GNSS-based navigation), critical infrastructure deployments (e.g., power grids), or hybrid human–machine control setups. These remain valuable directions for extension.
  • Scope of Threat Levels: UAVThreatBench explicitly models threats at the communication, software/firmware, and backend logic levels, in line with the categories mandated by RED Articles 3.3(d)–(f). Other practically relevant threat levels, such as physical drone capture, outdoor GNSS-based interference, or organizational/socio-technical risks, are not included in this release. This scope is intentional: RED compliance emphasizes radio communication integrity, data/privacy protection, and fraud/economic harm, which correspond directly to the modeled threat categories. While this limitation narrows coverage, it ensures that UAVThreatBench remains tightly aligned with RED requirements. Future work may extend the benchmark to incorporate multi-level threat scenarios beyond the current regulatory focus.
  • RED Article Mapping Ambiguity: Each threat was tagged with a single RED Article label for clarity and evaluability. However, in real audits, some threats may span multiple RED concerns (e.g., confidentiality and control). Assigning only the most salient label simplifies legal–technical mappings but may underrepresent cross-category threats.
  • Timing and Infrastructure Dependence: The reported time efficiency metric reflects average latency per scenario, which conflates model-intrinsic decoding speed with provider infrastructure and network effects (e.g., OpenAI via commercial API, LLaMA via academic API). These values are therefore environment-dependent and not directly comparable as pure model performance. Moreover, the measured latencies of 1–2 s per scenario are acceptable for batch-style or analyst-assisted risk assessment workflows, which is the intended use case of UAVThreatBench. They are not suitable for strict real-time UAV control (e.g., flight stabilization or collision avoidance), which lies outside the scope of this benchmark. Local deployment of models could reduce latency, but this was not evaluated in the present study.
  • Decoding non-determinism: Even with fixed parameters (temperature = 0.7, top-p = 1.0), outputs remain subject to stochastic variability. Provider documentation confirms that outputs are only “mostly deterministic” even at temperature = 0.0, and recent studies report 3–5% variability across repeated runs [45,46,47,48]. Our results may therefore be affected by similar levels of variation.

6. Conclusions

This study introduced UAVThreatBench, the first structured benchmark for evaluating LLMs in cybersecurity threat identification aligned with the European Radio Equipment Directive (RED) in UAV–OT deployment contexts. Based on a novel expert-curated dataset of 924 scenarios and more than 4600 annotated threats mapped to RED Articles 3.3(d), (e) and (f), the benchmark provides a reproducible foundation for assessing LLM performance in safety-critical domains.
Experimental evaluation of seven state-of-the-art LLMs on a representative subset of 100 scenarios from the UAVThreatBench dataset showed that, instruction-tuned models such as GPT-4o and GPT-3.5-turbo-16k achieved strong threat alignment, with total match rates of 87% and 79%, respectively. In contrast, open-weight models like LLaMA-3.3 attained a 76% total match but only 19% perfect matches, highlighting persistent gaps in handling availability threats (RED (d)) and backend-layer risks (RED (f)). Semantic analysis confirmed underperformance across all models in capturing denial-of-service vectors and MES/WMS logic threats critical for RED-compliant cybersecurity.
In relation to the research questions, the results show that LLMs are capable of generating RED-aligned threats, but their coverage remains uneven: while privacy-related threats under RED (e) were reliably identified, systematic gaps persisted for availability threats (RED (d)) and backend risks (RED (f)) (RQ1). Closed-weight, instruction-tuned models consistently outperformed open-weight counterparts, achieving higher exact matches and more consistent RED tagging (RQ2). At the same time, error analysis revealed recurring deficiencies, including missed availability threats, under-coverage of backend logic, and occasional non-generation failures (notably in GPT-4.1 Nano), underscoring the fragility of current LLMs in safety-critical domains (RQ3).
The scope of this study is limited to indoor UAV–OT deployment scenarios, and findings may not generalize directly to other domains or operational environments. Nevertheless, UAVThreatBench provides one of the first systematic empirical evaluations of LLMs for regulatory-aligned cybersecurity threat mapping in UAV–OT systems. Beyond benchmarking, the dataset establishes a reusable basis for advancing research on LLM robustness, OT-aware prompting, pretraining strategies, and compliance validation. To enable open and reproducible development of trustworthy AI for industrial cybersecurity, the UAVThreatBench dataset is publicly released under the MIT license via a dedicated GitHub repository.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data and evaluation artifacts supporting the findings of this study are publicly available under the MIT license. This includes the full UAVThreatBench dataset of 924 expert-annotated UAV–OT scenarios, the subset of 100 scenarios used for evaluation, and all model-generated outputs from the seven evaluated LLMs. The dataset is available in a dedicated GitHub repository for open access and reproducibility: see the UAVThreatBench project page at github.com/piyenghar/UAVThreatBench [39].

Acknowledgments

The author is grateful to the anonymous reviewers whose suggestions have substantially improved the clarity, rigor, and overall quality of the manuscript.

Conflicts of Interest

Author Padma Iyenghar was employed by the company innotec GmbH. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
5GFifth Generation Mobile Network
AGVAutomated Guided Vehicle
AIArtificial Intelligence
APIApplication Programming Interface
AS/RSAutomated Storage and Retrieval System
BLEBluetooth Low Energy
CoTChain of Thought
DBDatabase
EUEuropean Union
GNSSGlobal Navigation Satellite System
HMIHuman–Machine Interface
ICSIndustrial Control System
IoTInternet of Things
LLMLarge Language Model
LTELong-Term Evolution
MESManufacturing Execution System
NLPNatural Language Processing
OTOperational Technology
PLCProgrammable Logic Controller
RAGRetrieval-Augmented Generation
REDRadio Equipment Directive
UAVUnmanned Aerial Vehicle
Wi-FiWireless Fidelity
WMSWarehouse Management System
WPA2Wi-Fi Protected Access 2

Appendix A. Breakdown of No Match Cases

To further clarify the interpretation of No Match outcomes, Table A1 provides a detailed breakdown of cases where no expert-aligned threats were identified. No Match outcomes are separated into two categories:
  • Non-Generation: The model failed to generate any threats, despite the prompt explicitly requiring two threats per scenario.
  • Mismatch: The model generated two threats, but neither overlapped with the five expert-defined threats curated in Stage II.
Several observations emerge:
  • GPT-4o, GPT-4.1, GPT-4.1 Mini, and LLaMA-3.3 consistently generated threats; their No Match cases resulted exclusively from semantic mismatch rather than complete failure.
  • GPT-3.5 Turbo showed a single non-generation case, with the majority of errors due to mismatches.
  • LLaMA-3.1 SauerkrautLM exhibited two non-generation failures, but most No Match cases stemmed from mismatches.
  • GPT-4.1 Nano was an outlier: in 45% of scenarios it failed to generate any threats, despite the prompt requirement for two. This highlights fundamental robustness issues in adhering to task instructions.
This analysis demonstrates that while most No Match outcomes reflect mismatches with the ground truth, some models, particularly GPT-4.1 Nano, suffered from systematic non-generation, undermining their suitability for automated threat elicitation in industrial risk assessment settings.
Table A1. Breakdown of No Match cases into non-generation versus mismatch across all 100 UAVThreatBench scenarios. “Non-Generation” indicates cases where the model failed to output any threats. “Mismatch” indicates cases where threats were produced but none matched the ground truth.
Table A1. Breakdown of No Match cases into non-generation versus mismatch across all 100 UAVThreatBench scenarios. “Non-Generation” indicates cases where the model failed to output any threats. “Mismatch” indicates cases where threats were produced but none matched the ground truth.
ModelNo Match (%)Non-Generation (%)Mismatch (%)
GPT-4o13013
GPT-3.5 Turbo 16k21120
GPT-4.124024
GPT-4.1 Mini21021
LLaMA-3.3 70B Instruct24024
LLaMA-3.1 SauerkrautLM 70B41239
GPT-4.1 Nano564511

References

  1. Genc, H.; Zu, Y.; Chin, T.-W.; Halpern, M.; Reddi, V.J. Flying IoT: Toward Low-Power Vision in the Sky. IEEE Micro 2017, 37, 40–51. [Google Scholar] [CrossRef]
  2. European Union. Directive 2014/53/EU of the European Parliament and of the Council of 16 April 2014 on the Harmonisation of the Laws of the Member States Relating to the Making Available on the Market of Radio Equipment (Radio Equipment Directive). 2014. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32014L0053 (accessed on 25 July 2025).
  3. EN 18031-1:2024; Network Protection (RED 3.3(d)). Official Journal of the European Union, Commission Implementing Decision 2025/138, 2025. Harmonised Standard for Article 3.3(d) with Restrictions. CEN/CENELEC: Brussels, Belgium, 2024.
  4. EN 18031-2:2024; Data Protection (RED 3.3(e)). Official Journal of the European Union, Commission Implementing Decision 2025/138, 2025. Harmonised Standard for Article 3.3(e) with Restrictions. CEN/CENELEC: Brussels, Belgium, 2024.
  5. EN 18031-3:2024; Fraud Prevention (RED 3.3(f)). Official Journal of the European Union, Commission Implementing Decision 2025/138, 2025. Harmonised Standard for Article 3.3(f) with Restrictions. CEN/CENELEC: Brussels, Belgium, 2024.
  6. ISO 21384-3:2023; Unmanned Aircraft Systems: Operational Procedures, 2023. Specifies Safety and Security Requirements for Commercial UAS Operations—Including Command and Control Link (C2) Protocols. International Organization for Standardization (ISO): Geneva, Switzerland, 2023.
  7. IEC 62443-4-2:2019/COR1:2022; Security for Industrial Automation and Control Systems–Part 4-2: Technical Security Requirements for IACS Components. IEC Webstore and VDE Verlag. International Electrotechnical Commission: Geneva, Switzerland, 2019.
  8. Regulation (EU) 2024/2847 of the European Parliament and of the Council of 13 March 2024 on Horizontal Cybersecurity Requirements for Products with Digital Elements and Amending Regulation (EU) 2019/1020. Official Journal of the European Union, L 2024/2847, 10 December 2024. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R2847 (accessed on 30 July 2025).
  9. European Parliament and the Council of the European Union. Directive (EU) 2022/2555 of 14 December 2022 on measures for a high common level of cybersecurity across the Union. Also known as the NIS 2 Directive; repeals Directive (EU) 2016/1148. Off. J. Eur. Union 2022, L 333, 80–152. [Google Scholar]
  10. European Commission. Commission Delegated Regulation (EU) 2022/30 of 29 October 2021 Supplementing Directive 2014/53/EU with Regard to the Application of the Essential Requirements Referred to in Article 3(3)(d), (e), and (f). 2022. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32022R0030 (accessed on 25 July 2025).
  11. Liu, W.; Li, Q.; Xu, Y.; Jin, X.; Song, Y.; Liu, X.; Lin, Z.; Li, J.; Liu, P.; Liu, B.; et al. CyberLLM: Evaluating Large Language Models for Cyber Threat Intelligence Tasks. arXiv 2024, arXiv:2503.23175. [Google Scholar]
  12. Yue, X.; Ma, L.; Behnke, K.; Xu, H.; Wu, W.; Liang, Y.; Yuan, X.; Han, X.; Zhang, C.; Shah, N.H.; et al. On the Inconsistency of Reasoning in Large Language Models. arXiv 2024, arXiv:2502.07036. [Google Scholar]
  13. Carter, J.; Smith, A.; Lee, B.; Zhang, C.; Kumar, R.; Davis, L.; Johnson, M.; Hernandez, P.; Miller, T.; Chen, Y.; et al. Toward Secure AI: A National Security Perspective on Large Language Models. Technical Report; SEI—Software Engineering Institute, Carnegie Mellon University: Pittsburgh, PA, USA, 2023. [Google Scholar]
  14. Mekdad, Y.; Aris, A.; Babun, L.; El Fergougui, A.; Conti, M.; Lazzeretti, R.; Uluagac, A.S. A survey on security and privacy issues of UAVs. Comput. Netw. 2023, 224, 109626. [Google Scholar] [CrossRef]
  15. Tsao, K.Y.; Girdler, T.; Vassilakis, V.G. A survey of cyber security threats and solutions for UAV communications and flying ad-hoc networks. Ad Hoc Netw. 2022, 133, 102894. [Google Scholar] [CrossRef]
  16. Bai, N.; Hu, X.; Wang, S. A survey on unmanned aerial systems cybersecurity. J. Syst. Archit. 2024, 156, 103282. [Google Scholar] [CrossRef]
  17. Zolfaghari, B.; Abbasmollaei, M.; Hajizadeh, F.; Yanai, N.; Bibak, K. Secure UAV (Drone) and the Great Promise of AI. ACM Comput. Surv. 2024, 56, 1–37. [Google Scholar] [CrossRef]
  18. Sarıkaya, B.S.; Bahtiyar, Ş. A survey on security of UAV and deep reinforcement learning. Ad Hoc Netw. 2024, 164, 103642. [Google Scholar] [CrossRef]
  19. Tlili, F.; Ayed, S.; Fourati, L.C. Advancing UAV security with artificial intelligence: A comprehensive survey of techniques and future directions. Internet Things 2024, 27, 101281. [Google Scholar] [CrossRef]
  20. Adil, M.; Song, H.; Mastorakis, S.; Abulkasim, H.; Farouk, A.; Jin, Z. UAV-assisted IoT applications, cybersecurity threats, AI-enabled solutions, open challenges with future research directions. IEEE Trans. Intell. Veh. 2024, 9, 4583–4605. [Google Scholar] [CrossRef]
  21. Alturki, N.; Aljrees, T.; Umer, M.; Ishaq, A.; Alsubai, S.; Djuraev, S.; Ashraf, I. An intelligent framework for cyber–physical satellite system and IoT-aided aerial vehicle security threat detection. Sensors 2023, 23, 7154. [Google Scholar] [CrossRef]
  22. Miao, S.; Pan, Q. Risk Assessment of UAV Cyber Range Based on Bayesian–Nash Equilibrium. Drones 2024, 8, 556. [Google Scholar] [CrossRef]
  23. Yang, Z.; Zhang, Y.; Zeng, J.; Yang, Y.; Jia, Y.; Song, H.; Lv, T.; Sun, Q.; An, J. AI-Driven Safety and Security for UAVs: From Machine Learning to Large Language Models. Drones 2025, 9, 392. [Google Scholar] [CrossRef]
  24. Iyenghar, P. Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-In-The-Loop Framework for Machinery Functional Safety Risk Analysis. Eng 2025, 6, 31. [Google Scholar] [CrossRef]
  25. Iyenghar, P.; Zimmer, C.; Gregorio, C. A Feasibility Study on Chain-of-Thought Prompting for LLM-Based OT Cybersecurity Risk Assessment. In Proceedings of the 8th IEEE International Conference on Industrial Cyber-Physical Systems, ICPS 2025, Emden, Germany, 12–15 May 2025; pp. 1–4. [Google Scholar] [CrossRef]
  26. Iyenghar, P. Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics 2025, 14, 3624. [Google Scholar] [CrossRef]
  27. Iyenghar, P. On the Development and Application of a Structured Dataset for Data-Driven Risk Assessment in Industrial Functional Safety. In Proceedings of the 2025 IEEE 21st International Conference on Factory Communication Systems (WFCS), Rostock, Germany, 10–13 June 2025; pp. 1–8. [Google Scholar] [CrossRef]
  28. Iyenghar, P.; Hu, Y.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. AI-Based Assistant for Determining the Required Performance Level for a Safety Function. In Proceedings of the IECON 2022—48th Annual Conference of the IEEE Industrial Electronics Society, Brussels, Belgium, 17–20 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
  29. Iyenghar, P.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. A Chatbot Assistant for Reducing Risk in Machinery Design. In Proceedings of the 2023 IEEE 21st International Conference on Industrial Informatics (INDIN), Lemgo, Germany, 18–20 July 2023; pp. 1–8. [Google Scholar] [CrossRef]
  30. Chen, Y.; Cui, M.; Wang, D.; Cao, Y.; Yang, P.; Jiang, B.; Lu, Z.; Liu, B. A survey of large language models for cyber threat detection. Comput. Secur. 2024, 145, 104016. [Google Scholar] [CrossRef]
  31. Bhusal, D.; Alam, M.T.; Nguyen, L.; Mahara, A.; Lightcap, Z.; Frazier, R.; Fieblinger, R.; Torales, G.L.; Blakely, B.A.; Rastogi, N. SECURE: Benchmarking Large Language Models for Cybersecurity. arXiv 2024, arXiv:2405.20441. [Google Scholar] [CrossRef]
  32. Bhatt, M.; Chennabasappa, S.; Li, Y.; Nikolaidis, C.; Song, D.; Wan, S.; Ahmad, F.; Aschermann, C.; Chen, Y.; Kapil, D.; et al. CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models. arXiv 2024, arXiv:2404.13161. [Google Scholar]
  33. European Telecommunications Standards Institute. Cyber Security for Consumer Internet of Things: Baseline Requirements. ETSI EN 303 645 V2.1.1 (2020-06). 2020. Available online: https://www.etsi.org/deliver/etsi_en/303600_303699/303645/02.01.01_60/en_303645v020101p.pdf (accessed on 30 July 2025).
  34. Ye, N.; Wu, Q.; Ouyang, Q.; Hou, C.; Zhang, Y.; Kang, B.; Pan, J. Fly, Sense, Compress, and Transmit: Satellite-Aided Airborne Secure Data Acquisition in Harsh Remote Area for Intelligent Transportations. IEEE Trans. Intell. Transp. Syst. 2025, 1–14, Early Access.. [Google Scholar] [CrossRef]
  35. Ye, J.; Zhang, C.; Lei, H.; Pan, G.; Ding, Z. Secure UAV-to-UAV Systems With Spatially Random UAVs. IEEE Wirel. Commun. Lett. 2019, 8, 564–567. [Google Scholar] [CrossRef]
  36. Udvaros, J. Industrial and Technological Security with Drones in Logistics Centers. Zenodo, 2024. Available online: https://as-proceeding.com/index.php/ijanser/article/view/2510 (accessed on 1 September 2025).
  37. Fernandez-Carames, T.M.; Blanco-Novoa, O.; Froiz-Miguez, I.; Fraga-Lamas, P. Towards an autonomous Industry 4.0 warehouse: A UAV and blockchain-based system for inventory and traceability applications. arXiv 2024, arXiv:2402.00709. [Google Scholar] [CrossRef]
  38. Innotec GmbH-TÜV Austria Group: Partner for Safeware Engineering. Available online: https://www.innotecsafety.com/ (accessed on 5 July 2025).
  39. Iyenghar, P. UAVThreatBench: RED-Compliant Benchmark Dataset for UAV–OT Cybersecurity Threat Identification. GitHub Repository, 2025. Available online: https://github.com/piyenghar/UAVThreatBench (accessed on 30 July 2025).
  40. OpenAI. GPT-4 Technical Report. OpenAI Website. 2023. Available online: https://openai.com/research/gpt-4 (accessed on 30 July 2025).
  41. OpenAI. ChatGPT and GPT-3.5 Overview. OpenAI API Documentation. 2023. Available online: https://platform.openai.com/docs/models/gpt-3-5 (accessed on 30 July 2025).
  42. OpenAI. GPT-4o: OpenAI’s Omnimodal Model. OpenAI Website. 2024. Available online: https://platform.openai.com/docs/models/gpt-4o (accessed on 30 July 2025).
  43. AI, M. LLaMA 3: Open Foundation and Instruction Models. Meta AI Blog. 2024. Available online: https://ai.meta.com/llama/ (accessed on 30 July 2025).
  44. Team, S. SauerkrautLM: High-Quality German Instruction-Tuned LLaMA-3. HuggingFace Model Card. 2024. Available online: https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct (accessed on 30 July 2025).
  45. Google Cloud. Vertex AI Generative AI Overview: Deterministic Output. 2025. Available online: https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview (accessed on 2 September 2025).
  46. Anthropic. Anthropic API Documentation: Sampling and Determinism. 2025. Available online: https://docs.anthropic.com/claude/reference/complete_post (accessed on 2 September 2025).
  47. OpenAI. OpenAI API Documentation: Reproducibility, Seed, and System Fingerprint. 2025. Available online: https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter (accessed on 2 September 2025).
  48. Zou, J.; Xu, M.; Liang, P. Non-Determinism of “Deterministic” Decoding Settings in Large Language Models. arXiv 2025, arXiv:2501.01234. [Google Scholar]
  49. OpenAI. OpenAI API—Chat Completions Format, 2024. Available online: https://platform.openai.com/docs/guides/text (accessed on 14 September 2025).
  50. SeatGeek. fuzzywuzzy: Fuzzy String Matching in Python. Available online: https://github.com/seatgeek/fuzzywuzzy (accessed on 5 May 2025).
Figure 1. Risk assessment workflow for RED compliance based on EN 18031-1 [3].
Figure 1. Risk assessment workflow for RED compliance based on EN 18031-1 [3].
Drones 09 00657 g001
Figure 2. Representative UAV–OT system architecture used in UAVThreatBench. The diagram illustrates the three main architectural layers, namely, UAV subsystem, communication/OT access, and backend enterprise systems, annotated with RED (d/e/f) attack surfaces. While not all dataset dimensions (e.g., specific OT subsystems, data flow functions, or attack vector classes) are shown, the figure provides the abstraction on which scenario generation and threat elicitation are based.
Figure 2. Representative UAV–OT system architecture used in UAVThreatBench. The diagram illustrates the three main architectural layers, namely, UAV subsystem, communication/OT access, and backend enterprise systems, annotated with RED (d/e/f) attack surfaces. While not all dataset dimensions (e.g., specific OT subsystems, data flow functions, or attack vector classes) are shown, the figure provides the abstraction on which scenario generation and threat elicitation are based.
Drones 09 00657 g002
Figure 3. LLM inference time per scenario on UAVThreatBench (100 scenarios).
Figure 3. LLM inference time per scenario on UAVThreatBench (100 scenarios).
Drones 09 00657 g003
Figure 4. Threat classification outcomes across 100 benchmark scenarios. Model outputs are labeled as Perfect, Partial, or No Match based on semantic similarity (≥70) and correct RED Article tag.
Figure 4. Threat classification outcomes across 100 benchmark scenarios. Model outputs are labeled as Perfect, Partial, or No Match based on semantic similarity (≥70) and correct RED Article tag.
Drones 09 00657 g004
Table 1. Plausibility mapping of UAV roles, OT components, and communication protocols for each data flow function. This table operationalizes the abstract UAV–OT system model in Figure 2, showing how structured scenario dimensions are instantiated to generate the benchmark cases in UAVThreatBench.
Table 1. Plausibility mapping of UAV roles, OT components, and communication protocols for each data flow function. This table operationalizes the abstract UAV–OT system model in Figure 2, showing how structured scenario dimensions are instantiated to generate the benchmark cases in UAVThreatBench.
Data Flow FunctionValid UAV RolesValid OT ComponentsValid Communication Protocols
Real-time inventory data uploadInventory Management DroneMES Server, Industrial Wi-Fi Access PointWPA2-Enterprise Wi-Fi, Private LTE/5G
Flight path updates and control signalsInventory Management Drone, Automated Parts Delivery DroneAGV Fleet Control System, Industrial Wi-Fi Access PointWPA2-Enterprise Wi-Fi, Private LTE/5G, Proprietary RF Link, Bluetooth LE
Automated firmware updatesInventory Management Drone, Automated Parts Delivery DroneICS PLC, MES ServerWired Ethernet, Private LTE/5G, WPA2-Enterprise Wi-Fi
Diagnostic logs transferInventory Management Drone, Automated Parts Delivery DroneAGV Fleet Control System, ICS PLC, MES ServerWPA2-Enterprise Wi-Fi, Private LTE/5G, Wired Ethernet, Bluetooth LE
Sensor readings (temperature, pressure)Inventory Management DroneMES Server, AGV Fleet Control SystemWPA2-Enterprise Wi-Fi, Private LTE/5G
Assembly instructions downloadAutomated Parts Delivery DroneAS/RS, MES ServerWPA2-Enterprise Wi-Fi, Private LTE/5G, Wired Ethernet
Table 2. Scenario CS-OT-UAV-0027: Per-model alignment with five expert-defined threats (T1–T5). The ground truth threats 1–5 are listed above; due to space constraints they are referred to here by number only. ✓ = semantic match (score 70 ); ✗ = no match; “Explicit” = correct RED Article stated.
Table 2. Scenario CS-OT-UAV-0027: Per-model alignment with five expert-defined threats (T1–T5). The ground truth threats 1–5 are listed above; due to space constraints they are referred to here by number only. ✓ = semantic match (score 70 ); ✗ = no match; “Explicit” = correct RED Article stated.
ModelT1T2T3T4T5RED Article
GPT-4oExplicit
GPT-4.1Explicit
GPT-4.1-miniExplicit
GPT-4.1-nano
GPT-3.5-turbo-16kExplicit
LLaMA-3.1Explicit
LLaMA-3.3Explicit
Table 3. Scenario CS-OT-UAV-0020: Per-model alignment with five expert-defined threats (T1–T5). The ground truth threats 1–5 are listed above; due to space constraints they are referred to here by number only. ✓ = semantic match (score 70 ); ✗ = no match; “Explicit” = correct RED Article stated.
Table 3. Scenario CS-OT-UAV-0020: Per-model alignment with five expert-defined threats (T1–T5). The ground truth threats 1–5 are listed above; due to space constraints they are referred to here by number only. ✓ = semantic match (score 70 ); ✗ = no match; “Explicit” = correct RED Article stated.
ModelT1T2T3T4T5RED Article
GPT-4oExplicit
GPT-4.1Explicit
GPT-4.1-miniExplicit
GPT-4.1-nanoExplicit
GPT-3.5-turbo-16kExplicit
LLaMA-3.1Explicit
LLaMA-3.3Explicit
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Iyenghar, P. UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification. Drones 2025, 9, 657. https://doi.org/10.3390/drones9090657

AMA Style

Iyenghar P. UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification. Drones. 2025; 9(9):657. https://doi.org/10.3390/drones9090657

Chicago/Turabian Style

Iyenghar, Padma. 2025. "UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification" Drones 9, no. 9: 657. https://doi.org/10.3390/drones9090657

APA Style

Iyenghar, P. (2025). UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification. Drones, 9(9), 657. https://doi.org/10.3390/drones9090657

Article Metrics

Back to TopTop