UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification
Abstract
1. Introduction
- Article (d)–protection of network and device integrity;
- Article (e)–protection of personal data and privacy;
- Article (f)–protection against fraud and economic harm.
1.1. Research Motivation
- RQ1: Can LLMs generate cybersecurity threats that align with RED Articles (d), (e), and (f)?
- RQ2: How do different families of LLMs (e.g., GPT-4, LLaMA) perform in RED-compliant threat identification across UAV–OT interaction scenarios?
- RQ3: What types of systematic errors or omissions occur in LLM-based threat generation for UAV–OT scenarios, and how do they affect RED article alignment?
1.2. Contributions
- 1.
- Curated Benchmark Dataset: A novel dataset comprising 924 plausible UAV–OT cyber-physical scenarios was constructed via constrained combinatorial synthesis. Each scenario is defined by structured parameters (e.g., UAV role, communication protocol, cybersecurity origin) and subsequently annotated with expert-curated threats aligned to RED Articles (d), (e), and (f), based on ETSI EN 18031-1 [3].
- 2.
- Empirical Evaluation of LLMs: A systematic evaluation was conducted on seven leading LLMs, including OpenAI’s GPT-4 variants and open-source LLaMA-3 models, assessing their ability to perform structured threat generation. The analysis includes quantitative metrics (e.g., RED Article classification accuracy, match rate) and qualitative insights (e.g., semantic plausibility).
2. Related Work
2.1. Surveys on Cybersecurity of UAVs
2.2. AI-Driven Techniques and Frameworks for UAV Threat Mitigation
2.3. Emergence of LLMs in UAV Cybersecurity and Threat Detection
3. UAVThreatBench Dataset
3.1. Stage I: Scenario Generation
- UAV Roles:
- −
- Inventory Management Drone—used for real-time stocktaking, inventory scanning, and warehouse mapping.
- −
- Automated Parts Delivery Drone—facilitates point-to-point transport of components across production zones.
- Interacting OT Components:
- −
- Manufacturing Execution System (MES) Server—orchestrates production operations and synchronizes UAV data.
- −
- AGV Fleet Control System—enables coordination between ground-based mobile robots and UAVs.
- −
- Industrial Wi-Fi Access Point—primary wireless gateway for telemetry and real-time data exchange.
- −
- Automated Storage and Retrieval System (AS/RS)—interacts with UAVs for automated materials handling.
- −
- Industrial Control System (ICS) PLC—governs low-level interfaces such as docking stations or firmware uploads.
- Communication Protocols:
- −
- Wi-Fi Protected Access 2 (WPA2)-Enterprise Wi-Fi—standard secure wireless network for UAV telemetry and control.
- −
- Private LTE/5G Network—offers higher bandwidth and low latency, typically for mission-critical data flows.
- −
- Proprietary RF Link—used in custom deployments where standard protocols are unsuitable.
- −
- Wired Ethernet (via docking station)—employed for high-integrity operations such as firmware upgrades.
- −
- Bluetooth Low Energy (LE)—used for proximity-based diagnostics or field pairing.
- Data Flow Functions:
- −
- Real-time inventory data upload;
- −
- Flight path updates and control signals;
- −
- Automated firmware updates;
- −
- Diagnostic logs transfer;
- −
- Sensor readings (temperature, pressure);
- −
- Assembly instructions download.
- Cybersecurity Origins and Attack Vectors: Each scenario is linked to one of five high-level categories that represent common attack surfaces in UAV–OT systems:
- −
- Network Interface (e.g., Unauthorized Wi-Fi Access, Insecure Ethernet Port);
- −
- Communication Link (e.g., Unencrypted LTE/5G Link, GNSS Spoofing);
- −
- Software/Firmware (e.g., Insecure Firmware Update Mechanism, Weak Credentials);
- −
- Physical Access Point (e.g., Unsecured USB Port, Exposed Diagnostic Interfaces);
- −
- Backend System Interaction (e.g., Insecure APIs to MES/WMS, Cloud Service Vulnerabilities).
- Potential Consequences: Each attack vector is associated with 2–3 plausible outcomes, grounded in the cybersecurity literature and standards (e.g., ETSI EN 303 645, IEC 62443). Examples include the following:
- −
- Data Exfiltration, Remote Code Execution, Command Injection;
- −
- Mission Disruption, Unauthorized System Control, Inventory Manipulation.
Listing 1. Example of a curated UAV–OT cybersecurity scenario with corresponding expected threats and mapping to RED Articles. |
3.2. Stage II: Human-in-the-Loop Threat Curation
- For each scenario, five distinct threats are identified. These threats are tailored to the attack vector and operational context of the scenario.
- Each threat entry consists of a concise textual description of the expected attack consequence (e.g., “replay attacks using captured LTE/5G data”) and a mapping to the corresponding RED Article, namely 3.3(d) (network protection), 3.3(e) (data/privacy protection), or 3.3(f) (fraud prevention). The manual annotation process ensures alignment with harmonized cybersecurity requirements for radio-connected industrial devices [2,33].
4. Experimental Design
4.1. Model Selection and Configuration
- OpenAI Models: GPT-3.5 Turbo 16k, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano, and GPT-4o [40,41,42]. These represent the latest production-grade models from OpenAI with varying inference costs and capabilities. While GPT-3.5 Turbo 16k serves as a strong low-cost baseline, GPT-4.1 and GPT-4o are state-of-the-art models with high reasoning performance. The GPT-4.1 Mini and GPT-4.1 Nano variants offer reduced latency and lower compute profiles, making them relevant for resource-constrained industrial deployments. All OpenAI models were accessed via the official OpenAI API under a commercial license with controlled temperature and token settings.
- LLaMA-3 Variants: LLaMA-3.1-SauerkrautLM-70B-Instruct and LLaMA-3.3-70B-Instruct [43,44]. These open-source instruction-tuned variants of Meta’s LLaMA-3 70B model family are adapted for multilingual and domain-specific alignment, optimized for complex task-following capabilities. The SauerkrautLM variant incorporates fine-tuning for longer-form, structured outputs and security-relevant instructions. These models were accessed through the Academic Cloud OpenChat API, supporting reproducible benchmarking outside closed commercial ecosystems.
- 1.
- Architectural Coverage: By including models ranging from ∼20 B (OpenAI mini/nano estimates) up to 70B parameters (LLaMA-3), we ensure the evaluation spans both lightweight deployment targets and large-scale reasoning systems.
- 2.
- Deployment Diversity: OpenAI models represent high-availability, production-ready cloud APIs, while LLaMA-3 variants represent community-accessible open-source options, enabling deployment flexibility in air-gapped or regulated OT environments.
- 3.
- Instruction-Tuning Relevance: All selected models are instruction-tuned and capable of structured output generation, which is critical for consistent JSON-formatted threat responses as required by the benchmark prompt format.
4.2. Prompting Strategy
4.3. Evaluation Metrics
- Perfect Match: Defined as case- and punctuation-insensitive equivalence between an LLM-generated threat and a ground truth threat.
- Partial Match: Based on high lexical and semantic similarity, where an LLM-generated threat achieved a token_set_ratio score of ≥70% with a ground truth threat. This category captures instances where the core meaning was conveyed despite variations in phrasing.
- No Match: Indicates that an LLM-generated threat did not achieve the minimum similarity threshold with any of the expert-defined ground truth threats for a given scenario. A detailed breakdown of No Match cases into non-generation versus mismatch outcomes is provided in Appendix A.
- Time Efficiency: Mean generation time per scenario, per model, which is computed from execution logs to assess latency and scalability.
- Threat Match Rate: Proportion of model-generated threats that matched ground truth threats across all scenarios (either fully or partially).
- RED Article Compliance: Fraction of correctly matched threats that also aligned with the corresponding RED Article label in the ground truth.
- Scenario Classification Distribution: Percentage of scenarios falling into each of the three evaluation classes: Perfect, Partial, or No Match.
5. Results and Discussion
5.1. Time Efficiency
5.2. Model Threat Evaluation Summary
- Perfect Match: At least two model-generated threats match expert-defined threats for a scenario with semantic similarity score ≥70 and the correct RED Article tag (d, e, or f).
- Partial Match: Only one threat meets the semantic similarity threshold (≥70) or matches semantically but has a missing or incorrect RED tag.
- No Match: None of the model-generated threats have a semantic similarity score ≥ 70 with any ground truth threat, or all plausible threats lack the correct RED Article.
5.2.1. Semantic Threat Evaluation: Scenario CS-OT-UAV-0027
- Threat 1:
- Unauthorized access via insecure API (RED (d));
- Threat 2:
- Interception of real-time inventory data (RED (e));
- Threat 3:
- Backend MES/WMS manipulation (RED (f));
- Threat 4:
- Denial-of-service (DoS) on the MES server (RED (d));
- Threat 5:
- Theft of sensitive data from insecure backend interactions (RED (e)).
- High recall for access and confidentiality threats: Six of seven models correctly identified unauthorized API exploitation (Threat 1, RED (d)) and data interception (Threat 2, RED (e)), reflecting robust recognition of surface-level API and communication risks.
- Systematic omission of availability risks: None of the models captured the DoS attack on the MES server (Threat 4), underscoring a common blind spot in reasoning about availability and resilience within UAV–OT contexts.
- Partial backend logic awareness: Several models (GPT-4.1, GPT-4.1-mini, GPT-3.5, LLaMA-3.1, LLaMA-3.3) included API exploitation narratives that partially overlap with backend manipulation (Threat 3), but these lacked explicit reference to backend logic alteration or process state integrity.
- Neglect of backend data theft: No model produced explicit references to backend data theft (Threat 5), suggesting underrepresentation of confidentiality risks beyond communication-level interception.
- Explicit RED compliance: Except for GPT-4.1-nano, all models explicitly mapped identified threats to the correct RED Articles (d, e, f), demonstrating their ability to align threat reasoning with regulatory categories when semantic matches were found.
5.2.2. Semantic Threat Evaluation: Scenario CS-OT-UAV-0020
- Threat 1:
- RF Jamming of the Private LTE/5G Network disrupting drone–MES communication (RED (d));
- Threat 2:
- Unauthorized interception of real-time inventory data (RED (e));
- Threat 3:
- Communication spoofing leading to incorrect MES updates (RED (f));
- Threat 4:
- Malware injection via compromised network (RED (d));
- Threat 5:
- Unauthorized command injection affecting inventory integrity (RED (f)).
- Strong recognition of connectivity threats: All models successfully identified the RF jamming attack (T1) and interception of inventory data (T2), confirming robust alignment with surface-level communication risks.
- Failure to capture deeper exploitation pathways: None of the models identified spoofing, malware injection, or command injection (T3–T5), indicating systematic underperformance in reasoning about backend or multi-stage attacks.
- Explicit RED compliance: All valid matches were explicitly mapped to the correct RED Articles, demonstrating that RED-aware alignment works reliably when lexical and semantic overlap exists.
5.3. Model Performance: Quantitative Summary
- GPT-4o demonstrated the strongest overall performance, achieving the highest Perfect Match rate (46%) and Total Match rate (87%), reflecting both syntactic precision and semantic completeness in RED-compliant threat generation.
- GPT-3.5-turbo-16k followed with a Total Match rate of 79% and a respectable Perfect Match rate (35%), showing consistent performance despite being a smaller, more cost-efficient model.
- LLaMA-3.3 and LLaMA-3.1 SauerkrautLM exhibited competitive Partial Match rates (44–57%) but much lower Perfect Match rates (15–19%), indicating the generation of semantically relevant but less precisely aligned outputs.
- GPT-4.1-nano performed the weakest, with only 44% Total Match and the highest No Match rate (56%), suggesting significant limitations in understanding or retrieving RED-relevant threats.
5.4. RED Article-Level Analysis and Failure Modes
- Redundancy: Threats were occasionally duplicated using paraphrased text, reducing semantic diversity.
- Misclassification: Some threats were incorrectly mapped to RED Articles, for instance, data exfiltration mislabeled under RED (f) instead of RED (e).
5.5. Identified Gaps and Recommendations
- RQ1: Can LLMs generate cybersecurity threats that align with RED Articles (d), (e), and (f)?As shown in the RED Article-level evaluation, most models reliably identify threats aligned with RED (d) and (e), primarily access control and data confidentiality concerns. However, performance significantly drops for RED (f), where threats involve backend logic manipulation or financial fraud. This suggests partial regulatory alignment: while LLMs are competent in recognizing common threat types, they struggle with domain-specific semantics, such as MES/WMS disruption or fraud scenarios, which are essential for RED (f) compliance.
- RQ2: How do different LLM families perform across UAV–OT scenarios?OpenAI models (GPT-4o, GPT-3.5-turbo) consistently outperformed open-source LLaMA variants in both perfect and partial threat matching. GPT-4o achieved 87% total match and 46% perfect match, while LLaMA models peaked at 57% partial match with only 15–19% perfect match. These results demonstrate a clear gradient of performance linked to instruction tuning and architectural scale. Open-weight LLMs can produce broadly plausible threats but frequently lack operational specificity and RED Article granularity.
- RQ3: What types of systematic errors or omissions occur in LLM-based threat generation, and how do they affect RED alignment?This evaluation surfaced several recurring failure patterns: (i) omission of availability threats such as DoS (RED (d)); (ii) limited representation of backend-layer risks (RED (f)); (iii) misclassification of threats across RED categories; and (iv) generation of redundant or semantically shallow outputs. These omissions directly affect the completeness and accuracy of RED-compliant threat generation, highlighting the models’ insufficient grasp of OT-specific reasoning and layered threat semantics.
- Threat type bias: Most models display a consistent bias toward threats impacting confidentiality and integrity, with significant underrepresentation of availability threats (e.g., DoS), despite their relevance under RED Article (d).
- Backend semantics: Threats involving backend logic, MES/WMS manipulation, or system-level economic consequences (RED (f)) are rarely identified, even by high-performing models. This points to a lack of domain grounding in industrial OT interactions.
- Regulatory mapping errors: Some threats were misclassified, e.g., privacy violations incorrectly labeled as RED (f), indicating incomplete understanding of RED clause semantics.
- Structured benchmarks: Future evaluations should explicitly assess LLM coverage across the full CIA triad (confidentiality, integrity, availability) and all RED Articles (d), (e), and (f).
- Domain-grounded prompting: Prompts should include OT-specific cues (e.g., WMS logic, industrial protocols, backend operations) to guide models toward deeper contextual threat reasoning.
- RED-aware evaluation: Validation frameworks should include clause-specific threat mapping to detect misclassification and assess clause-level recall/precision.
- Retrieval-augmented reasoning: Incorporating RED clause definitions, threat taxonomies, and example mappings through retrieval-based prompting may help mitigate hallucination and regulatory misalignment.
5.6. Threats to Validity
- Expert Annotation Validity: Ground truth threats were initially curated by a cybersecurity consultant and domain expert (affiliated with innotec GmbH-TÜV Austria Group [38], and author of this work), who led the Stage II human-in-the-loop curation process. To improve semantic precision and regulatory alignment, a second expert independently reviewed a subset of scenarios, with joint validation sessions used to resolve ambiguities. The final annotations were consolidated following these discussions and serve as a practical reference for evaluation rather than the only correct set; other plausible threats may exist. While this process improves annotation robustness, future releases will include formal inter-annotator agreement metrics and blind dual annotation, along with priority/severity metadata to capture differential relevance among threats.
- Semantic Threshold Sensitivity: The fuzzy matching threshold of 70% (token_set_ratio) was empirically set to balance paraphrase tolerance with specificity in short threat statements. While effective in capturing conceptual equivalence, different thresholds may shift performance outcomes. Future work will include systematic sensitivity analyses to quantify robustness of results across threshold values. The curated threats are treated as canonical handles for benchmarking, enabling consistent comparison across models. In real-world audits, however, risk assessment requires tolerance for variability and incomplete reasoning, as auditors consider whether generated threats are directionally plausible rather than exact replicas. This distinction highlights that UAVThreatBench measures structured alignment under controlled conditions, while practical deployment must account for margins of fault tolerance when interpreting LLM outputs.
- Scenario Domain Coverage: The dataset focuses on UAV roles in indoor industrial contexts (inventory and delivery). This does not capture outdoor UAV use cases (e.g., GNSS-based navigation), critical infrastructure deployments (e.g., power grids), or hybrid human–machine control setups. These remain valuable directions for extension.
- Scope of Threat Levels: UAVThreatBench explicitly models threats at the communication, software/firmware, and backend logic levels, in line with the categories mandated by RED Articles 3.3(d)–(f). Other practically relevant threat levels, such as physical drone capture, outdoor GNSS-based interference, or organizational/socio-technical risks, are not included in this release. This scope is intentional: RED compliance emphasizes radio communication integrity, data/privacy protection, and fraud/economic harm, which correspond directly to the modeled threat categories. While this limitation narrows coverage, it ensures that UAVThreatBench remains tightly aligned with RED requirements. Future work may extend the benchmark to incorporate multi-level threat scenarios beyond the current regulatory focus.
- RED Article Mapping Ambiguity: Each threat was tagged with a single RED Article label for clarity and evaluability. However, in real audits, some threats may span multiple RED concerns (e.g., confidentiality and control). Assigning only the most salient label simplifies legal–technical mappings but may underrepresent cross-category threats.
- Timing and Infrastructure Dependence: The reported time efficiency metric reflects average latency per scenario, which conflates model-intrinsic decoding speed with provider infrastructure and network effects (e.g., OpenAI via commercial API, LLaMA via academic API). These values are therefore environment-dependent and not directly comparable as pure model performance. Moreover, the measured latencies of 1–2 s per scenario are acceptable for batch-style or analyst-assisted risk assessment workflows, which is the intended use case of UAVThreatBench. They are not suitable for strict real-time UAV control (e.g., flight stabilization or collision avoidance), which lies outside the scope of this benchmark. Local deployment of models could reduce latency, but this was not evaluated in the present study.
- Decoding non-determinism: Even with fixed parameters (temperature = 0.7, top-p = 1.0), outputs remain subject to stochastic variability. Provider documentation confirms that outputs are only “mostly deterministic” even at temperature = 0.0, and recent studies report 3–5% variability across repeated runs [45,46,47,48]. Our results may therefore be affected by similar levels of variation.
6. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
5G | Fifth Generation Mobile Network |
AGV | Automated Guided Vehicle |
AI | Artificial Intelligence |
API | Application Programming Interface |
AS/RS | Automated Storage and Retrieval System |
BLE | Bluetooth Low Energy |
CoT | Chain of Thought |
DB | Database |
EU | European Union |
GNSS | Global Navigation Satellite System |
HMI | Human–Machine Interface |
ICS | Industrial Control System |
IoT | Internet of Things |
LLM | Large Language Model |
LTE | Long-Term Evolution |
MES | Manufacturing Execution System |
NLP | Natural Language Processing |
OT | Operational Technology |
PLC | Programmable Logic Controller |
RAG | Retrieval-Augmented Generation |
RED | Radio Equipment Directive |
UAV | Unmanned Aerial Vehicle |
Wi-Fi | Wireless Fidelity |
WMS | Warehouse Management System |
WPA2 | Wi-Fi Protected Access 2 |
Appendix A. Breakdown of No Match Cases
- Non-Generation: The model failed to generate any threats, despite the prompt explicitly requiring two threats per scenario.
- Mismatch: The model generated two threats, but neither overlapped with the five expert-defined threats curated in Stage II.
- GPT-4o, GPT-4.1, GPT-4.1 Mini, and LLaMA-3.3 consistently generated threats; their No Match cases resulted exclusively from semantic mismatch rather than complete failure.
- GPT-3.5 Turbo showed a single non-generation case, with the majority of errors due to mismatches.
- LLaMA-3.1 SauerkrautLM exhibited two non-generation failures, but most No Match cases stemmed from mismatches.
- GPT-4.1 Nano was an outlier: in 45% of scenarios it failed to generate any threats, despite the prompt requirement for two. This highlights fundamental robustness issues in adhering to task instructions.
Model | No Match (%) | Non-Generation (%) | Mismatch (%) |
---|---|---|---|
GPT-4o | 13 | 0 | 13 |
GPT-3.5 Turbo 16k | 21 | 1 | 20 |
GPT-4.1 | 24 | 0 | 24 |
GPT-4.1 Mini | 21 | 0 | 21 |
LLaMA-3.3 70B Instruct | 24 | 0 | 24 |
LLaMA-3.1 SauerkrautLM 70B | 41 | 2 | 39 |
GPT-4.1 Nano | 56 | 45 | 11 |
References
- Genc, H.; Zu, Y.; Chin, T.-W.; Halpern, M.; Reddi, V.J. Flying IoT: Toward Low-Power Vision in the Sky. IEEE Micro 2017, 37, 40–51. [Google Scholar] [CrossRef]
- European Union. Directive 2014/53/EU of the European Parliament and of the Council of 16 April 2014 on the Harmonisation of the Laws of the Member States Relating to the Making Available on the Market of Radio Equipment (Radio Equipment Directive). 2014. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32014L0053 (accessed on 25 July 2025).
- EN 18031-1:2024; Network Protection (RED 3.3(d)). Official Journal of the European Union, Commission Implementing Decision 2025/138, 2025. Harmonised Standard for Article 3.3(d) with Restrictions. CEN/CENELEC: Brussels, Belgium, 2024.
- EN 18031-2:2024; Data Protection (RED 3.3(e)). Official Journal of the European Union, Commission Implementing Decision 2025/138, 2025. Harmonised Standard for Article 3.3(e) with Restrictions. CEN/CENELEC: Brussels, Belgium, 2024.
- EN 18031-3:2024; Fraud Prevention (RED 3.3(f)). Official Journal of the European Union, Commission Implementing Decision 2025/138, 2025. Harmonised Standard for Article 3.3(f) with Restrictions. CEN/CENELEC: Brussels, Belgium, 2024.
- ISO 21384-3:2023; Unmanned Aircraft Systems: Operational Procedures, 2023. Specifies Safety and Security Requirements for Commercial UAS Operations—Including Command and Control Link (C2) Protocols. International Organization for Standardization (ISO): Geneva, Switzerland, 2023.
- IEC 62443-4-2:2019/COR1:2022; Security for Industrial Automation and Control Systems–Part 4-2: Technical Security Requirements for IACS Components. IEC Webstore and VDE Verlag. International Electrotechnical Commission: Geneva, Switzerland, 2019.
- Regulation (EU) 2024/2847 of the European Parliament and of the Council of 13 March 2024 on Horizontal Cybersecurity Requirements for Products with Digital Elements and Amending Regulation (EU) 2019/1020. Official Journal of the European Union, L 2024/2847, 10 December 2024. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R2847 (accessed on 30 July 2025).
- European Parliament and the Council of the European Union. Directive (EU) 2022/2555 of 14 December 2022 on measures for a high common level of cybersecurity across the Union. Also known as the NIS 2 Directive; repeals Directive (EU) 2016/1148. Off. J. Eur. Union 2022, L 333, 80–152. [Google Scholar]
- European Commission. Commission Delegated Regulation (EU) 2022/30 of 29 October 2021 Supplementing Directive 2014/53/EU with Regard to the Application of the Essential Requirements Referred to in Article 3(3)(d), (e), and (f). 2022. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32022R0030 (accessed on 25 July 2025).
- Liu, W.; Li, Q.; Xu, Y.; Jin, X.; Song, Y.; Liu, X.; Lin, Z.; Li, J.; Liu, P.; Liu, B.; et al. CyberLLM: Evaluating Large Language Models for Cyber Threat Intelligence Tasks. arXiv 2024, arXiv:2503.23175. [Google Scholar]
- Yue, X.; Ma, L.; Behnke, K.; Xu, H.; Wu, W.; Liang, Y.; Yuan, X.; Han, X.; Zhang, C.; Shah, N.H.; et al. On the Inconsistency of Reasoning in Large Language Models. arXiv 2024, arXiv:2502.07036. [Google Scholar]
- Carter, J.; Smith, A.; Lee, B.; Zhang, C.; Kumar, R.; Davis, L.; Johnson, M.; Hernandez, P.; Miller, T.; Chen, Y.; et al. Toward Secure AI: A National Security Perspective on Large Language Models. Technical Report; SEI—Software Engineering Institute, Carnegie Mellon University: Pittsburgh, PA, USA, 2023. [Google Scholar]
- Mekdad, Y.; Aris, A.; Babun, L.; El Fergougui, A.; Conti, M.; Lazzeretti, R.; Uluagac, A.S. A survey on security and privacy issues of UAVs. Comput. Netw. 2023, 224, 109626. [Google Scholar] [CrossRef]
- Tsao, K.Y.; Girdler, T.; Vassilakis, V.G. A survey of cyber security threats and solutions for UAV communications and flying ad-hoc networks. Ad Hoc Netw. 2022, 133, 102894. [Google Scholar] [CrossRef]
- Bai, N.; Hu, X.; Wang, S. A survey on unmanned aerial systems cybersecurity. J. Syst. Archit. 2024, 156, 103282. [Google Scholar] [CrossRef]
- Zolfaghari, B.; Abbasmollaei, M.; Hajizadeh, F.; Yanai, N.; Bibak, K. Secure UAV (Drone) and the Great Promise of AI. ACM Comput. Surv. 2024, 56, 1–37. [Google Scholar] [CrossRef]
- Sarıkaya, B.S.; Bahtiyar, Ş. A survey on security of UAV and deep reinforcement learning. Ad Hoc Netw. 2024, 164, 103642. [Google Scholar] [CrossRef]
- Tlili, F.; Ayed, S.; Fourati, L.C. Advancing UAV security with artificial intelligence: A comprehensive survey of techniques and future directions. Internet Things 2024, 27, 101281. [Google Scholar] [CrossRef]
- Adil, M.; Song, H.; Mastorakis, S.; Abulkasim, H.; Farouk, A.; Jin, Z. UAV-assisted IoT applications, cybersecurity threats, AI-enabled solutions, open challenges with future research directions. IEEE Trans. Intell. Veh. 2024, 9, 4583–4605. [Google Scholar] [CrossRef]
- Alturki, N.; Aljrees, T.; Umer, M.; Ishaq, A.; Alsubai, S.; Djuraev, S.; Ashraf, I. An intelligent framework for cyber–physical satellite system and IoT-aided aerial vehicle security threat detection. Sensors 2023, 23, 7154. [Google Scholar] [CrossRef]
- Miao, S.; Pan, Q. Risk Assessment of UAV Cyber Range Based on Bayesian–Nash Equilibrium. Drones 2024, 8, 556. [Google Scholar] [CrossRef]
- Yang, Z.; Zhang, Y.; Zeng, J.; Yang, Y.; Jia, Y.; Song, H.; Lv, T.; Sun, Q.; An, J. AI-Driven Safety and Security for UAVs: From Machine Learning to Large Language Models. Drones 2025, 9, 392. [Google Scholar] [CrossRef]
- Iyenghar, P. Clever Hans in the Loop? A Critical Examination of ChatGPT in a Human-In-The-Loop Framework for Machinery Functional Safety Risk Analysis. Eng 2025, 6, 31. [Google Scholar] [CrossRef]
- Iyenghar, P.; Zimmer, C.; Gregorio, C. A Feasibility Study on Chain-of-Thought Prompting for LLM-Based OT Cybersecurity Risk Assessment. In Proceedings of the 8th IEEE International Conference on Industrial Cyber-Physical Systems, ICPS 2025, Emden, Germany, 12–15 May 2025; pp. 1–4. [Google Scholar] [CrossRef]
- Iyenghar, P. Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning. Electronics 2025, 14, 3624. [Google Scholar] [CrossRef]
- Iyenghar, P. On the Development and Application of a Structured Dataset for Data-Driven Risk Assessment in Industrial Functional Safety. In Proceedings of the 2025 IEEE 21st International Conference on Factory Communication Systems (WFCS), Rostock, Germany, 10–13 June 2025; pp. 1–8. [Google Scholar] [CrossRef]
- Iyenghar, P.; Hu, Y.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. AI-Based Assistant for Determining the Required Performance Level for a Safety Function. In Proceedings of the IECON 2022—48th Annual Conference of the IEEE Industrial Electronics Society, Brussels, Belgium, 17–20 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Iyenghar, P.; Kieviet, M.; Pulvermüller, E.; Wübbelmann, J. A Chatbot Assistant for Reducing Risk in Machinery Design. In Proceedings of the 2023 IEEE 21st International Conference on Industrial Informatics (INDIN), Lemgo, Germany, 18–20 July 2023; pp. 1–8. [Google Scholar] [CrossRef]
- Chen, Y.; Cui, M.; Wang, D.; Cao, Y.; Yang, P.; Jiang, B.; Lu, Z.; Liu, B. A survey of large language models for cyber threat detection. Comput. Secur. 2024, 145, 104016. [Google Scholar] [CrossRef]
- Bhusal, D.; Alam, M.T.; Nguyen, L.; Mahara, A.; Lightcap, Z.; Frazier, R.; Fieblinger, R.; Torales, G.L.; Blakely, B.A.; Rastogi, N. SECURE: Benchmarking Large Language Models for Cybersecurity. arXiv 2024, arXiv:2405.20441. [Google Scholar] [CrossRef]
- Bhatt, M.; Chennabasappa, S.; Li, Y.; Nikolaidis, C.; Song, D.; Wan, S.; Ahmad, F.; Aschermann, C.; Chen, Y.; Kapil, D.; et al. CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models. arXiv 2024, arXiv:2404.13161. [Google Scholar]
- European Telecommunications Standards Institute. Cyber Security for Consumer Internet of Things: Baseline Requirements. ETSI EN 303 645 V2.1.1 (2020-06). 2020. Available online: https://www.etsi.org/deliver/etsi_en/303600_303699/303645/02.01.01_60/en_303645v020101p.pdf (accessed on 30 July 2025).
- Ye, N.; Wu, Q.; Ouyang, Q.; Hou, C.; Zhang, Y.; Kang, B.; Pan, J. Fly, Sense, Compress, and Transmit: Satellite-Aided Airborne Secure Data Acquisition in Harsh Remote Area for Intelligent Transportations. IEEE Trans. Intell. Transp. Syst. 2025, 1–14, Early Access.. [Google Scholar] [CrossRef]
- Ye, J.; Zhang, C.; Lei, H.; Pan, G.; Ding, Z. Secure UAV-to-UAV Systems With Spatially Random UAVs. IEEE Wirel. Commun. Lett. 2019, 8, 564–567. [Google Scholar] [CrossRef]
- Udvaros, J. Industrial and Technological Security with Drones in Logistics Centers. Zenodo, 2024. Available online: https://as-proceeding.com/index.php/ijanser/article/view/2510 (accessed on 1 September 2025).
- Fernandez-Carames, T.M.; Blanco-Novoa, O.; Froiz-Miguez, I.; Fraga-Lamas, P. Towards an autonomous Industry 4.0 warehouse: A UAV and blockchain-based system for inventory and traceability applications. arXiv 2024, arXiv:2402.00709. [Google Scholar] [CrossRef]
- Innotec GmbH-TÜV Austria Group: Partner for Safeware Engineering. Available online: https://www.innotecsafety.com/ (accessed on 5 July 2025).
- Iyenghar, P. UAVThreatBench: RED-Compliant Benchmark Dataset for UAV–OT Cybersecurity Threat Identification. GitHub Repository, 2025. Available online: https://github.com/piyenghar/UAVThreatBench (accessed on 30 July 2025).
- OpenAI. GPT-4 Technical Report. OpenAI Website. 2023. Available online: https://openai.com/research/gpt-4 (accessed on 30 July 2025).
- OpenAI. ChatGPT and GPT-3.5 Overview. OpenAI API Documentation. 2023. Available online: https://platform.openai.com/docs/models/gpt-3-5 (accessed on 30 July 2025).
- OpenAI. GPT-4o: OpenAI’s Omnimodal Model. OpenAI Website. 2024. Available online: https://platform.openai.com/docs/models/gpt-4o (accessed on 30 July 2025).
- AI, M. LLaMA 3: Open Foundation and Instruction Models. Meta AI Blog. 2024. Available online: https://ai.meta.com/llama/ (accessed on 30 July 2025).
- Team, S. SauerkrautLM: High-Quality German Instruction-Tuned LLaMA-3. HuggingFace Model Card. 2024. Available online: https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct (accessed on 30 July 2025).
- Google Cloud. Vertex AI Generative AI Overview: Deterministic Output. 2025. Available online: https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview (accessed on 2 September 2025).
- Anthropic. Anthropic API Documentation: Sampling and Determinism. 2025. Available online: https://docs.anthropic.com/claude/reference/complete_post (accessed on 2 September 2025).
- OpenAI. OpenAI API Documentation: Reproducibility, Seed, and System Fingerprint. 2025. Available online: https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter (accessed on 2 September 2025).
- Zou, J.; Xu, M.; Liang, P. Non-Determinism of “Deterministic” Decoding Settings in Large Language Models. arXiv 2025, arXiv:2501.01234. [Google Scholar]
- OpenAI. OpenAI API—Chat Completions Format, 2024. Available online: https://platform.openai.com/docs/guides/text (accessed on 14 September 2025).
- SeatGeek. fuzzywuzzy: Fuzzy String Matching in Python. Available online: https://github.com/seatgeek/fuzzywuzzy (accessed on 5 May 2025).
Data Flow Function | Valid UAV Roles | Valid OT Components | Valid Communication Protocols |
---|---|---|---|
Real-time inventory data upload | Inventory Management Drone | MES Server, Industrial Wi-Fi Access Point | WPA2-Enterprise Wi-Fi, Private LTE/5G |
Flight path updates and control signals | Inventory Management Drone, Automated Parts Delivery Drone | AGV Fleet Control System, Industrial Wi-Fi Access Point | WPA2-Enterprise Wi-Fi, Private LTE/5G, Proprietary RF Link, Bluetooth LE |
Automated firmware updates | Inventory Management Drone, Automated Parts Delivery Drone | ICS PLC, MES Server | Wired Ethernet, Private LTE/5G, WPA2-Enterprise Wi-Fi |
Diagnostic logs transfer | Inventory Management Drone, Automated Parts Delivery Drone | AGV Fleet Control System, ICS PLC, MES Server | WPA2-Enterprise Wi-Fi, Private LTE/5G, Wired Ethernet, Bluetooth LE |
Sensor readings (temperature, pressure) | Inventory Management Drone | MES Server, AGV Fleet Control System | WPA2-Enterprise Wi-Fi, Private LTE/5G |
Assembly instructions download | Automated Parts Delivery Drone | AS/RS, MES Server | WPA2-Enterprise Wi-Fi, Private LTE/5G, Wired Ethernet |
Model | T1 | T2 | T3 | T4 | T5 | RED Article |
---|---|---|---|---|---|---|
GPT-4o | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
GPT-4.1 | ✓ | ✓ | ✓ | ✗ | ✗ | Explicit |
GPT-4.1-mini | ✓ | ✓ | ✓ | ✗ | ✗ | Explicit |
GPT-4.1-nano | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
GPT-3.5-turbo-16k | ✓ | ✓ | ✓ | ✗ | ✗ | Explicit |
LLaMA-3.1 | ✓ | ✓ | ✓ | ✗ | ✗ | Explicit |
LLaMA-3.3 | ✓ | ✓ | ✓ | ✗ | ✗ | Explicit |
Model | T1 | T2 | T3 | T4 | T5 | RED Article |
---|---|---|---|---|---|---|
GPT-4o | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
GPT-4.1 | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
GPT-4.1-mini | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
GPT-4.1-nano | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
GPT-3.5-turbo-16k | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
LLaMA-3.1 | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
LLaMA-3.3 | ✓ | ✓ | ✗ | ✗ | ✗ | Explicit |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Iyenghar, P. UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification. Drones 2025, 9, 657. https://doi.org/10.3390/drones9090657
Iyenghar P. UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification. Drones. 2025; 9(9):657. https://doi.org/10.3390/drones9090657
Chicago/Turabian StyleIyenghar, Padma. 2025. "UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification" Drones 9, no. 9: 657. https://doi.org/10.3390/drones9090657
APA StyleIyenghar, P. (2025). UAVThreatBench: A UAV Cybersecurity Risk Assessment Dataset and Empirical Benchmarking of LLMs for Threat Identification. Drones, 9(9), 657. https://doi.org/10.3390/drones9090657