Abstract
Highlights
What are the main findings?
- A structured prompting framework, guided by HFACS with evidence binding and cross-level checks, enables LLMs to extract operator, precondition, supervisory, and organizational factors from UAV accident narratives, with improved inference on UAV-specific factors.
- A human-factor-annotated UAV accident dataset was developed from ASRS reports and used to benchmark the method, showing performance comparable to expert assessment, particularly on UAV-specific categories.
What is the implication of the main finding?
- LLM-assisted investigation provides a rapid, systematic triage layer that surfaces human-related and organization-related cues early from sparse, unstructured occurrence reports.
- The approach complements safety management and regulatory oversight in low-altitude operations by preserving causal reasoning and supporting human-in-the-loop verification across the reporting-to-classification workflow.
Abstract
UAV accident investigation is essential for safeguarding the fast-growing low-altitude airspace. While near-daily incidents are reported, they were rarely analyzed in depth as current inquiries remain expert-dependent and time-consuming. Because most jurisdictions mandate formal reporting only for serious injury or substantial property damage, a large proportion of minor occurrences receive no systematic investigation, resulting in persistent data gaps and hindering proactive risk management. This study explores the potential of using large language models (LLMs) to expedite UAV accident investigations by extracting human-factor insights from unstructured narrative incident reports. Despite their promise, the off-the-shelf LLMs still struggle with domain-specific reasoning in the UAV context. To address this, we developed a human factors analysis and classification system (HFACS)-guided analytical framework, which blends structured prompting with lightweight post-processing. This framework systematically guides the model through a two-stage procedure to infer operators’ unsafe acts, their latent preconditions, and the associated organizational influences and regulatory risk factors. A HFACS-labelled UAV accident corpus comprising 200 abnormal event reports with 3600 coded instances has been compiled to support evaluation. Across seven LLMs and 18 HFACS categories, macro-F1 ranged 0.58–0.76; our best configuration achieved macro-F1 0.76 (precision 0.71, recall 0.82), with representative category accuracies > 93%. Comparative assessments indicate that the prompted LLM can match, and in certain tasks surpass, human experts. The findings highlight the promise of automated human factor analysis for conducting rapid and systematic UAV accident investigations.
1. Introduction
Unmanned aircraft systems (UAS) play a critical role in disaster response, infrastructure inspection, environmental monitoring, precision agriculture, and emerging low-altitude logistics in the developing low-altitude economy [,,]. Despite accelerating operational demand, near-daily incidents are reported, yet, they were rarely subjected to systematic causal analysis because formal reporting in most jurisdictions is triggered only by serious injury or substantial third-party property damage [,,]. Investigating these occurrences is crucial to prevent future similar accidents []. Under the current regulatory framework in the U.S., for example, a remote pilot must notify the Federal Aviation Administration (FAA) only when a small UAS occurrence involves serious injury or more than $500 in third-party property damage []. Consequently, the vast majority of low-consequence events never enter any authoritative database. Even when a report is filed, follow-up proceedings proceed under the conventional manned-aviation model: the U.S. National Transportation Safety Board notes that a typical accident inquiry requires 12–24 months for the expert-driven process to reach a probable cause determination []. While this study focuses on the U.S. regulatory context, it is noteworthy that other jurisdictions adopt similarly limited mandatory reporting thresholds; for example, in Australia, most small remotely piloted aircraft (RPA) operations only require reporting in cases of death or serious injury [], and in the European Union, reporting under Regulation (EU) 376/2014 [] applies to UAS operators pursuant to Article 19(2) of Regulation (EU) 2019/947 [], with mandatory reporting for certified or declared UAS operations, and for occurrences involving manned aircraft or resulting in serious injury.
Deriving pre-investigation human-factor cues from initial occurrence reports is a pragmatic strategy in UAS accident work, because human error remains the predominant contributor to mishaps in unmanned-aircraft operations as well as in the broader civil-aviation accident record [,]. Such early extraction narrows the search space before committing scarce resources. For drones, these reports—typically the remote pilot’s mandatory notification or voluntary safety submission—often constitute the only consistently available evidence []. Their structured descriptions enable the timely identification of operator-centric errors stemming from stress, diminished situational awareness, or flawed judgment. This directs limited analytical resources toward the most probable causal pathways, providing invaluable insight to accident investigators.
Unmanned aircraft initial accident reports are typically submitted in an unstructured textual format, offering a distinctive opportunity to apply large language models (LLMs) to aviation safety analytics. Prior studies have shown that coupling LLMs with legacy, labor-intensive investigation pipelines can yield markedly smarter analysis workflows []. Yet, these reports are highly fragmented, frequently incomplete, and dense with specialized aeronautical terminology; reporter-introduced omissions and errors further erode data quality and must be repaired with domain expertise. Bringing such semi-structured data into fixed analytical frameworks stretches basic natural language processing (NLP) to its limits, and deploying LLMs in a trustworthy, system-level manner for domain-specific tasks remains challenging []. The difficulty extends to agentic architectures, where LLMs must acquire explicit reasoning skills, tool-use proficiency, and embedded aviation knowledge to execute complex analytic workflows []. Designing modular pipelines that co-ordinate specialized components while preserving interpretability and reliability is therefore a central hurdle []. Fom a deployment perspective, UAV-assisted mobile-edge computing has explored joint optimization of computation offloading and flight paths under tight latency; a recent KNN-DDPG approach reports lower end-to-end delay, motivating near-real-time, edge-side HFACS-LLM deployment []. Recent studies show that injecting domain knowledge into LLM-driven agents significantly improves reasoning accuracy and task performance []. Complementary to these LLM-centered advances, digital-twin-based drone forensics reconstructs incidents in ROS 2/Gazebo to recover physical trajectories and system states; by contrast, our study targets narrative-driven HFACS–LLM attribution from initial reports []. Nevertheless, a true end-to-end multi-agent framework for curating and analyzing UAV Aviation Safety Reporting System (ASRS) data [] is still lacking. To close this gap, we introduce a human factors analysis and classification system (HFACS)-guided, human-in-the-loop multi-agent architecture that interactively assists investigators in completing incomplete reports, secures data integrity before analysis, and then classifies human errors across the four HFACS tiers: unsafe acts, preconditions, unsafe supervision, and organizational influence [,] (see Figure 1). This integration leverages AI for systematic safety assessment while retaining essential expert oversight throughout the reporting-to-classification lifecycle.

Figure 1.
Main structure of HFACS 8.0.
Notably, the unmodified HFACS 8.0—though well established for crewed aviation—does not fully express the techno-human particularities of UAV operations, including remote piloting, heavy automation dependence, data-link fragility, and distinctive ground-station interfaces [,,]. We therefore introduce HFACS-UAV, which preserves the four canonical tiers but refines selected definitions and nanocodes for UAV contexts and augments narrative analysis with structured inputs from the ASRS. Concretely, HFACS-UAV treats free-text narratives as the evidential core. The search space is conditioned through a field-to-factor mapping. For example, phase-of-flight descriptors cue likely error modes, while atmospheric and lighting conditions indicate physical-environment factors and perceptual risks. Automation usage highlights technological-environmental influences and out-of-the-loop states. Information on crew roles and handoffs signals personnel factors and supervisory issues. Finally, callback notes often surface latent organizational contributors [,]. The resulting procedure proceeds in three stages. Stage 1 classifies the operator’s unsafe acts by determining, in sequence, whether they represent a skill-based error, a judgment or decision-making error, a perceptual error, or a deliberate violation, with each classification explicitly grounded in cited narrative evidence. Stage 2 infers the enabling preconditions—most notably UAV-centric technological and physical environment factors and the operator’s state and readiness—anchored to the report text and conditioned, though not dictated, by structured fields. Stage 3 extends the analysis to higher tiers by identifying contributory supervisory and organizational influences where the narrative permits, thereby completing the multi-level HFACS reasoning process. To strengthen reasoning, we employ level-wise prompting with explicit evidence binding and lightweight, programmatic consistency checks across levels. Experiments on self-built UAV-ASRS-HFACS reports with a modern LLM indicate that this configuration surpasses generic HFACS prompts and rule-based baselines, with the most significant gains on UAV-salient categories such as technological environment and organizational process; ablations confirm that both the refined UAV definitions and the field-to-factor conditioning contribute materially to the improvement.
In summary, the key contributions of this work include the following:
- We present an end-to-end LLM-assisted pipeline that integrates intake, human-in-the-loop supplementation, data validation, analytic triage, and HFACS reasoning, with a publicly available prototype (access at https://uav-accident-forensics.streamlit.app/, accessed on 7 October 2025).
- We propose HFACS-UAV, an adaptation of the four-tier HFACS framework with UAV-specific nanocodes and ASRS field mapping, implemented through a three-stage prompt design with evidence binding and consistency checks.
- We provide a curated UAV-ASRS-HFACS-2010–2025(05) corpus, covering Part 107, Section 44809, and Public UAS operations. Each record links narratives with structured fields; the dataset will be released under ASRS terms.
- We demonstrate through comparative evaluation that GPT-5 models match, and in some UAV-specific categories surpass, human experts in HFACS inference, highlighting the potential of LLMs in UAV accident investigation.
The rest of this paper is organized as follows. Section 2 reviews related work and identifies research gaps. Section 3 introduces the HFACS-UAV framework and our incident analysis pipeline. Section 4 describes our annotated UAV incident corpus. Section 5 presents experimental results and ablation studies. Section 6 discusses implications and concludes the research.
2. Related Works
2.1. UAV Accident Investigation: Current Practices and Challenges
The rapid proliferation of unmanned aerial vehicles (UAVs) across commercial, recreational, and governmental sectors—together with the emerging “low-altitude economy”— has fundamentally transformed the practice of accident investigation. Traditional methods developed for manned aviation through decades of formal mishap analysis and logical-model research [], and codified in ICAO Annex 13 [], provide an essential foundation, but they only partially address UAV-specific failure modes such as command-and-control (C2) link loss, on-board autonomy malfunctions, and the distinctive workload and situational-awareness challenges faced by remote pilots.
Empirical evidence confirms the gap: Wild et al. [] analyzed 152 civil-drone accident and incident reports and showed that technical system malfunctions—not pilot factors—were the single largest causal category. Conversely, when human error is involved, the consequences can be severe: a HFACS (see Figure 1) review of 221 U.S. military UAV mishaps found that 60% contained at least one unsafe act by a remote pilot or supervisor []. These findings underscore the need for investigation frameworks that integrate both engineered-system and human-system perspectives. Regulation adds another layer of complexity. In the United States, the Small UAS Rule (14 CFR Part 107) requires formal reporting only when an operation results in serious injury or property damage greater than $500, creating a substantial under-reporting bias []. The European Union’s Implementing Regulation (EU) 2019/947 establishes comparable thresholds, leading to a similar “iceberg” of unreported occurrences []. Without a systematic collection of lower-severity events, valuable lead indicators of emerging hazards remain hidden.
Finally, data availability remains a critical bottleneck. A recent FAA ASSURE study showed that fewer than 5% of small-UAS platforms in the United States carry any form of crash-surivable flight-data recorder, and that where logging exists, parameter sets and sampling rates vary widely across manufacturers []. The absence of cockpit-voice or datalink-voice recordings forces investigators to rely on partial telemetry and post-event interviews, complicating causal reconstruction. These technological, regulatory, and data-management challenges highlight the need for investigation tools and evidence-collection strategies that are specifically tailored to the socio-technical realities of UAV operations.
2.2. HFACS: Origins, Structure, and Aviation Applications
The human factors analysis and classification system (HFACS) is based on Reason’s theory of organizational accidents, which explains how unsafe acts are enabled by latent conditions within supervision and organizations rather than being solely the result of frontline operator error [,,]. Building on this conceptual basis, Wiegmann and Shappell formalized HFACS as a four-level framework—(L1) Unsafe Acts, (L2) Preconditions for Unsafe Acts, (L3) Unsafe Supervision, and (L4) organizational Influences—with defined causal categories supporting systematic analysis of human and organizational contributory factors in aviation mishaps (see Figure 1) [,,]. Over the past two decades, HFACS has been widely adopted in accident investigation, safety audits, and safety management systems (SMS), providing a common taxonomy that improves comparability, traceability, and learning across events []. The U.S. Department of Defense maintains an updated HFACS 8.0 implementation that clarifies definitions and nanocodes for consistent application across domains [].
In UAV operations, recent studies both reinforce and refine HFACS. Analyses of medium/large-UAV mishaps coded with HFACS report that Level 1 Decision Errors commonly predominate, which are statistically associated with Level 2 Technological Environment and Adverse Mental State; contributory factors also appear at supervisory and organizational tiers, informing regulator- and SMS-oriented interventions []. Focusing on loss of control in flight (LOC-I), El Safany and Bromfield examined a corpus of UAV accidents and identified design/manufacturing issues as significant factors; they further observed that recovery attempts are often missing in reported cases. Comparing HFACS, AcciMap, and an Accident Route Matrix, they proposed an adapted ARM-UAV to better maintain event sequences and illustrate recovery for LOC-I cases []. Complementing taxonomy-based analysis, Alharasees and Kale integrated HFACS with OODA and SHELL using an analytic hierarchy process and combined this with real-time operator physiology (HR/HRV/RR) across automation levels; novices exhibited higher stress responses, emphasizing operator-centered interventions within training [].
2.3. Large Language Models in Aviation Safety Analysis
Aviation safety text mining has advanced rapidly from rule-based keyword heuristics and shallow classifiers to transformer encoders purpose-built for specialist corpora. Early studies such as Tanguy et al. [] framed Directorate-General of Civil Aviation narratives as bag-of-words vectors and obtained F1 scores around 0.80 with support-vector machines, yet they faltered when confronted with the syntactic complexity and dense jargon of accident prose. The arrival of BERT-style encoders narrowed this gap: fine-tuning generic DistilBERT lifted exact-match accuracy for question answering on the ASRS corpus to ∼70%, but the decisive improvement came from domain pre-training. AviationBERT, trained on roughly 2.3 million ASRS and NTSB narratives, outperformed BERT-Base by 4–10 percentage points on topic classification and semantic similarity tasks, while NASA’s SafeAeroBERT achieved equivalent gains on causal-factor labeling [,]. These results establish a clear pattern: contextual encoders benefit disproportionately from sector-specific corpora that capture operational phraseology, regulatory references, and domain abbreviations. The generative turn has further expanded analytical possibilities. Large models such as GPT-3 and GPT-4 can now draft coherent accident summaries and infer causal chains with only few-shot prompts []. Crucially, Liu et al. [] embedded the HFACS into a stepwise chain-of-thought scaffold, raising adjusted average F1 by 27.5% over zero-shot GPT-4 and surpassing human raters on several unsafe-act sub-tasks. This success has catalyzed a new class of bespoke generative models: AviationGPT extends LLaMA-2 with continual training on flight-operations data, delivering over 40% gains on in-domain QA and summarisation benchmarks while foreshadowing multimodal LLMs that will fuse cockpit-voice, flight-data-recorder, and textual evidence within a unified narrative engine [].
Yet, these advances remain overwhelmingly concentrated in manned aviation domains. Recent systematic reviews reveal that fewer than 2% of aviation safety NLP papers address UAV incidents [,]. This disparity is reflected in U.S. Air Force safety data: between 2006 and 2011, the MQ-9 Reaper accumulated a lifetime Class-A mishap rate of 4.6 per 100,000 flight-hours, whereas mature manned fleets recorded rates below 1 per 100,000 flight-hours during 2010–2018 [,]. The few UAV-specific efforts demonstrate both promise and fundamental limitations. Zibaei and Borth [] developed ArduCorpus, containing 935,000 sentences from UAV technical resources, achieving 32% higher precision in cause–effect extraction compared to pure machine learning approaches. Similarly, Editya et al. [] fine-tuned multimodal models for drone accident forensics, yet, these isolated efforts lack the comprehensive framework necessary for systematic UAV safety analysis.
The technical challenges of adapting LLMs to UAV contexts extend beyond data scarcity. UAV operations introduce unique failure modes absent in manned aviation: data-link interruptions, ground control station interface failures, and autonomous system malfunctions that existing aviation taxonomies fail to capture adequately []. Moreover, the heterogeneity of UAV platforms—from consumer quadcopters to military-grade fixed-wing systems—defies the relatively standardized frameworks developed for commercial aviation. Token length constraints further compound these challenges, as comprehensive UAV accident analysis requires simultaneous processing of telemetry logs, sensor streams, environmental data, and operator narratives that frequently exceed current model capacities. Regulatory and validation challenges present additional barriers. The absence of AI-specific standards in aviation has prevented the adoption of LLM-based safety analysis in operational contexts []. Traditional deterministic validation approaches prove inadequate for probabilistic language models, while the “black box” nature of transformer architectures conflicts with aviation’s requirements for transparent, auditable decision-making. Recent audits indicate that explainability dashboards serve technical developers rather than frontline investigators, leaving a critical transparency gap that undermines trust and adoption []. The overwhelming focus on post-incident analysis neglects real-time safety monitoring capabilities. While manned aviation benefits from comprehensive Flight Data Monitoring (FDM) programs integrated with safety management systems, UAV operations lack equivalent infrastructure for continuous safety assessment. The latency requirements for real-time UAV decision support—often measured in milliseconds for collision avoidance—exceed current LLM inference capabilities by orders of magnitude, constraining applications to retrospective analysis rather than proactive risk mitigation.
Overall, although LLMs have already delivered substantive gains in manned-aviation safety analysis—through specialized models such as SafeAeroBERT and AviationGPT—their application to UAV safety remains nascent and fragmented. The unique operational characteristics of unmanned systems—remote piloting paradigms, heterogeneous platforms, elevated accident rates, and distinct failure modes—demand purpose-built approaches rather than adapted manned aviation solutions. This gap motivates our development of HFACS-UAV and associated prompting methodologies, specifically designed to address the technical, regulatory, and operational challenges inherent in UAV accident investigation while leveraging the demonstrated capabilities of modern language models.
2.4. Prompt Engineering and Multi-Agent Systems for Domain-Specific Reasoning
The application of LLMs to complex reasoning tasks has revealed a fundamental tension between generative capabilities and propensity for hallucination in specialized domains []. Chain-of-Thought (CoT) prompting has emerged as a transformative approach, enabling LLMs to decompose complex problems into tractable reasoning steps []. This paradigm has evolved into sophisticated variants: Tree of Thoughts (ToT) explores multiple reasoning paths simultaneously [], while self-consistency sampling aggregates diverse reasoning chains to improve reliability []. The distinction between manual and automatic CoT generation reveals critical insights for domain adaptation—while zero-shot CoT using “let’s think step by step” shows promise [], domain-specific tasks require carefully crafted examples that encode expert knowledge []. Recent work demonstrates that integrating domain expertise directly into prompt structures yields substantial improvements: Singhal et al. [] achieved physician-level performance by embedding clinical guidelines into medical reasoning prompts, while Liu et al. [] integrated HFACS into aviation safety analysis, surpassing human expert accuracy on several classification tasks. The emergence of multi-agent systems further expands these capabilities []. By orchestrating specialized agents that maintain distinct knowledge bases and reasoning strategies, complex analytical tasks can be decomposed across multiple experts []. Hong et al. [] demonstrated that structured multi-agent collaboration could achieve 85.9% success rates on complex software engineering tasks, suggesting similar potential for multi-faceted accident investigation.
Complementary to CoT and multi-agent prompting, recent retrieval-augmented generation (RAG) methods reduce hallucinations by explicitly coupling reasoning with targeted retrieval. ReAct interleaves chain-of-thought with tool calls so the model plans, retrieves, and verifies iteratively []; Self-RAG learns when to retrieve and how to critique drafts via reflection tokens, improving factuality and citation faithfulness []; Active RAG (FLARE) decides when/what to retrieve during generation rather than only once up front []; GraphRAG builds corpus-level knowledge graphs to support multi-hop evidence aggregation for complex queries [].
Yet, significant gaps remain in applying these techniques to UAV accident analysis. Existing prompt engineering assumes stable domain knowledge, while UAV operations span rapidly evolving technologies with shifting failure modes—from early loss-of-link scenarios to contemporary autonomy conflicts []. Current multi-agent frameworks lack the rigor required for safety-critical analysis, providing neither evidence traceability nor deterministic guarantees demanded by aviation standards []. Most critically, the heterogeneous nature of UAV accident data—spanning narrative reports, telemetry logs, maintenance records, and regulatory notices—exceeds current integration capabilities. While retrieval-augmented generation shows promise [], existing approaches focus on isolated tasks rather than the end-to-end investigative workflow from initial report intake through final safety recommendations. The fragmentation of UAV incident data, with most minor occurrences never entering formal databases, demands systems capable of active data collection, intelligent report completion, and progressive refinement through human-in-the-loop interaction. These limitations motivate our development of an integrated multi-agent framework that combines HFACS-guided prompting with comprehensive investigation capabilities, addressing the complete lifecycle of UAV accident analysis while maintaining the rigor demanded by safety-critical applications.
3. Methodology
3.1. UAV-Specific Enhancement-Pattern Development
This study adopted a four-phase, evidence-driven procedure (Figure 2) to extend HFACS 8.0 to unmanned-aviation contexts. The design mirrors best practice in taxonomy engineering for aviation mishap research [,,], ensuring both methodological rigor and theoretical fidelity.

Figure 2.
Systematic four-phase methodology for developing UAV-specific HFACS enhancement patterns.
3.1.1. Phase 1—Data Acquisition and Screening
We harvested 847 small-UAS occurrence reports from the NASA Aviation Safety Reporting System (ASRS) [] and the National Transportation Safety Board (NTSB) accident database [], covering the period January 2010–May 2025. This window brackets the introduction of FAA Part 107 [] and captures sUAS integration and oversight efforts [,]. Reports were retained when they satisfied the inclusion rule:
where denotes human-operator involvement, a five-point narrative-detail score (), and the presence of causal statements. Dual-assessor screening yielded 726 eligible reports (85.7 % retention; Cohen’s ), providing the empirical substrate for pattern discovery.
3.1.2. Phase 2—Knowledge Synthesis and Pattern Extraction
In the subsequent knowledge-synthesis phase, thematic analysis was conducted following the established methodology of Braun and Clarke []. Recent large-scale UAV human-factor audits [,,] informed the inductive coding stage and ensured construct coverage. The analysis employed both inductive and deductive coding approaches, with initial codes derived from existing HFACS categories and emergent codes identified through a systematic review of incident narratives. Patterns were extracted and ranked using a frequency-weighted relevance-scoring method:
where is the normalized relevance score of pattern j; N is the total number of reports (726); is the binary occurrence of pattern j in report i (0 or 1); and denotes the severity weight based on outcome severity (1 = minor incident, 3 = substantial-damage accident). Severity weights were assigned in accordance with the accident-outcome scale in the FAA Aeronautical Information Manual, (AIM § 11-8-4, 2025), § 8 []. Initially, 35 candidate patterns exceeded the empirically determined threshold (); these were subsequently refined through iterative team review, yielding a consolidated list of 26 patterns that demonstrated both theoretical coherence and practical significance.
As depicted in Figure 2, the methodology progressed through four distinct phases, each supported by specific quality-control measures. Phase 1 established the empirical foundation through systematic data collection and screening, achieving high inter-reviewer reliability (Cohen’s ) for inclusion decisions. Phase 2 employed established thematic-analysis techniques to extract and consolidate patterns, with a frequency-weighted scoring method ensuring that patterns captured both prevalence and severity.
3.1.3. Phase 3—Expert Validation
Phase 3 incorporated expert domain knowledge through a structured Delphi process [,], achieving substantial consensus across all patterns. Then, it delivered statistical validation of pattern structure and content validity, confirming the theoretical coherence of the enhanced framework. Expert validation used a two-round Delphi technique, engaging five domain specialists with 7–15 years of UAV and human factor experience. The panel comprised two certified remote pilots, one aviation safety analyst, one automation researcher, and one regulatory-compliance engineer.
In Round 1, each expert rated every pattern on three criteria—theoretical relevance, practical significance, and definitional clarity—using a five-point Likert scale. The composite score for pattern p was as follows:
where is the number of experts and is expert e’s mean rating for pattern p. Patterns with advanced to Round 2, reflecting the 75-percentile retention rule recommended by []. Inter-rater reliability across all patterns was quantified with Fleiss’ kappa,
yielding (95 % CI: 0.65–0.92), interpreted as substantial agreement under the Landis–Koch scale [].
Round 2 targeted items with initial disagreement ( for the individual pattern). After reviewing anonymized feedback, experts re-rated those items. Consensus was measured with the inter-quartile range (IQR) index,
where is the consensus index for pattern p, the inter-quartile range of re-ratings, and the effective width of the 1–5 scale. A higher indicates stronger consensus. Following [], consensus was declared when (equivalently, ). All 26 patterns exceeded this benchmark, with a mean (SD = 0.08).
Table A1 present the complete set of 26 UAV-specific enhancement patterns with their operational definitions and representative case examples derived from the ASRS incident database. Each pattern is mapped to specific HFACS categories while addressing unique operational characteristics of UAV systems. The organizational level patterns emphasize regulatory compliance challenges specific to UAV operations, including Part 107 requirements and LAANC authorization processes that have no direct equivalent in traditional manned aviation. The preconditions level demonstrates the highest pattern concentration, reflecting the critical role of technological dependencies such as command and control link reliability and telemetry accuracy in UAV safety outcomes.
3.1.4. Phase 4—Statistical Validation
Content validity was assessed through expert evaluation using Lawshe’s Content Validity Ratio (CVR) []:
where is the number of experts rating pattern p as essential (rating on 5-point scale), and is the total number of experts. With five experts, Lawshe CVR can only assume values of {−1, −0.6, −0.2, 0.2, 0.6, 1.0}. Individual CVR values therefore ranged from 0.60 to 1.00, and 21 of 26 patterns (80.8%) met the critical cutoff of 0.60 for []. The mean CVR was 0.72.
Given the limited size of our expert panel (5 experts × 26 patterns), traditional factor analysis was not feasible. We therefore employed theoretical mapping validation as an alternative. This approach, whereby each pattern was systematically mapped to its corresponding HFACS level, achieved 96% expert consensus and provides appropriate structural validity evidence for our sample parameters.
Model fit was further evaluated using confirmatory indices from the expert rating data:
- Pattern-level agreement: Mean (range: 0.71–0.94).
- Content validity index: 85% (23/26 patterns with CVR ).
- Theoretical alignment: 96% expert consensus on HFACS mapping.
- Practical relevance: Mean rating 4.2/5.0 (SD = 0.6).
These metrics collectively indicate that the 26 UAV-specific enhancement patterns demonstrate strong psychometric properties and appropriately extend the HFACS framework while maintaining theoretical coherence and practical applicability.
This enhanced framework fills critical methodological gaps in UAV safety analysis. Its systematic and statistically validated development provides an empirically robust extension of HFACS tailored explicitly for UAV operations, ensuring both practical relevance and theoretical alignment with existing aviation safety frameworks. This enhanced framework addresses critical gaps in current UAV safety analysis methodologies and provides a foundation for more effective incident investigation and prevention strategies in the rapidly evolving unmanned aviation sector. Detailed definitions of all extracted UAV enhancement patterns are provided in Table A1.
3.2. HFACS-UAV Framework Development
Development of the HFACS-UAV framework proceeded through three sequential stages— architectural design, pattern encoding, and LLM integration.
3.2.1. Stage 1—Architectural Design
The framework extends the four-level HFACS hierarchy [] via a dual-taxonomy architecture that embeds UAV-specific patterns while preserving the original causal structure (Figure 3). The HFACS 8.0 taxonomy is preserved in full (four levels, 18 categories) with no deletions or renamings; UAV-specific factors are introduced solely as additive extensions mapped to their respective HFACS parent categories. Each pattern is linked to its parent HFACS category, enabling the following:

Figure 3.
HFACS–UAV architecture linking 26 UAV enhancement patterns to HFACS 8.0. UPPER-CASE codes (e.g., AE100, PE200, SI000, OC000) indicate original HFACS 8.0 categories, the titled pattern nodes (e.g., C2 Link Reliability, Battery Constraints) are UAV-specific extensions mapped to their HFACS parents.
- (a)
- Selective activation—patterns are invoked only when incident context demands;
- (b)
- Backward compatibility with classic HFACS analyses;
- (c)
- Parallel causal reasoning across mixed manned- and unmanned-aviation chains [].
Together, these properties ensure both flexibility and interoperability of the HFACS-UAV framework. For clarity and analytical consistency, we organize the taxonomy into four hierarchical levels (L1–L4).
3.2.2. Stage 2—Pattern Encoding
Each enhancement pattern is defined as follows:
where D = definition, I = empirical indicators, M = HFACS mappings, R = inter-pattern relations, and C = activation constraints []. Directed-graph relations enable causal-chain reconstruction using do-calculus reasoning [].
3.2.3. Stage 3—LLM Integration
The framework’s integration with large language models employs structured prompt engineering to leverage encoded pattern representations for automated incident analysis (Figure 4). Incident narratives T are processed through a confidence-weighted pattern matching algorithm:
where , , and represent weighting parameters for semantic similarity, indicator presence, and contextual relevance, respectively. The functions , , and compute semantic similarity, indicator matching, and contextual relevance between incident narratives and pattern components. ranges from 0 to 1, with higher values indicating stronger pattern activation.

Figure 4.
End-to-end LLM workflow from incident narratives and pattern schemas to confidence-scored, structured outputs.
The prompt engineering methodology incorporates the complete pattern knowledge base as contextual information, enabling LLM access to pattern definitions, indicators, and inter-pattern relationships during analysis. Few-shot learning examples derived from expert-analyzed incidents provide concrete examples of pattern application, ensuring consistent interpretation across incident types [].
The integration employs structured JSON output format containing pattern identifications, confidence scores, supporting evidence excerpts, and reasoning chains that trace from identified patterns to textual evidence. For cases where pattern matching confidence falls below predetermined thresholds, the system implements human-in-the-loop mechanisms that flag incidents for expert review []. This hybrid approach ensures analytical quality while enabling scalable processing of large incident datasets, maintaining compatibility with the dual-taxonomy architecture.
3.3. End-to-End Incident Processing Pipeline
The theoretical framework culminates in our end-to-end incident processing pipeline that integrates automated report collection, intelligent completion, and evidence-grounded analytical workflows. This comprehensive system transforms the HFACS-UAV methodology into a practical tool capable of processing real-world incident data while maintaining analytical rigor and transparency. Figure 5 presents the complete system interface, demonstrating the integration of multiple analytical modules within a unified platform.

Figure 5.
End-to-end platform integrating system configuration, data management, LLM-based expert analysis, and an HFACS coding interface.
3.3.1. Intelligent Data Preprocessing and Completion System
The system incorporates a structured pipeline that addresses the inherent incompleteness and variability of real-world incident reports. The preprocessing begins with automated ASRS data parsing and UAV-specific incident identification, followed by systematic completeness assessment and intelligent information completion.
Data Completeness Assessment: The system evaluates data sufficiency through a structured assessment of critical information fields. Based on the analysis of ASRS reporting requirements and HFACS analytical needs, we identified 23 critical fields across four categories: temporal–spatial context (date, time, location, airspace), aircraft specifications (make, model, weight, propulsion type), operational parameters (flight phase, control method, mission type), and incident characteristics (primary problem, contributing factors, environmental conditions). The completeness assessment is performed by the LLM through structured evaluation prompts that assess both field presence and information quality.
Intelligent Information Completion (human-in-the-loop): After the witness submits the oral narrative via the smart reporting interface, the system performs LLM-based extraction, assigns confidence to each field, and tracks the completeness of the ASRS form. For records with insufficient completeness, the system employs LLM-driven contextual reasoning to infer missing information and generate targeted completion questions. The completion mechanism leverages the comprehensive UAV knowledge base to provide domain-informed recommendations and maintains explicit confidence tracking for all inferred information. These questions are returned to the witness for clarification, while a human investigator reviews the preliminary extraction and the evolving ASRS form. All inferred items retain explicit confidence tags and remain distinguishable from explicitly reported information. The LLM investigator and the human investigator operate as a joint team: the witness answers or edits, the system integrates new inputs and re-scores, and the human investigator performs a final confirmation before the record proceeds to HFACS 8.0 analysis (see Figure 6).

Figure 6.
Human-in-the-loop information-completion workflow linking witness Q&A, LLM-based extraction and -guided question generation, and investigator final confirmation.
3.3.2. LLM Integration and Intelligent Analysis Engine
The framework operationalizes the HFACS-UAV architecture through a sophisticated LLM integration strategy that transforms abstract reasoning frameworks into executable analytical workflows. The implementation employs structured prompt engineering, multi-dimensional confidence assessment, and comprehensive quality validation mechanisms.
Structured Prompt Engineering: The system implements dynamic prompt generation that systematically incorporates the four components of the HFACS-UAV architecture: domain knowledge injection , task decomposition , evidence binding , and confidence quantification . Each analysis session begins with the injection of relevant UAV operational knowledge selected from the comprehensive knowledge base containing 600+ technical terms, 28 UAV-specific risk factors, and regulatory frameworks. The task decomposition function breaks complex HFACS classification into sequential reasoning steps, guiding the LLM through systematic evaluation of each potential category with explicit reasoning chains.
Multi-Dimensional Quality Assessment: The system implements a comprehensive quality evaluation framework that assesses classification reliability across multiple dimensions. For each HFACS classification, the quality score is computed as follows:
where represents confidence appropriateness (evaluating whether confidence levels align with evidence strength), measures evidence quality (assessing specificity and relevance of textual citations), evaluates reasoning quality (examining logical coherence and narrative references), and assesses category-layer consistency (verifying alignment with HFACS hierarchical structure).
Confidence Calibration and Evidence Binding: The evidence binding function implements a structured assessment protocol with explicit confidence scoring: 0.9–1.0 for explicit textual statements, 0.7–0.8 for strong indirect evidence, 0.5–0.6 for moderate implications, 0.3–0.4 for weak indicators, and 0.1–0.2 for very weak or speculative evidence. This calibrated scoring system ensures analytical transparency and enables systematic validation of classification decisions through traceable evidence chains.
3.3.3. Interactive Analysis Platform and Quality Assurance
The complete system is implemented as a modular web-based application using the Streamlit framework, providing an intuitive interface for both individual incident analysis and batch processing of large datasets. The platform incorporates real-time progress monitoring, interactive visualization of HFACS classification results, and comprehensive quality assurance mechanisms including human-in-the-loop validation workflows.
The system employs structured JSON output format containing pattern identifications, confidence scores, supporting evidence excerpts, and reasoning chains that trace from identified patterns to textual evidence. As demonstrated in Figure 7, the platform provides comprehensive visualization of HFACS classification results through interactive hierarchical trees that display the four-layer structure with color-coded confidence levels and detailed classification metrics.

Figure 7.
Interactive HFACS hierarchy highlighting four selected categories across the four tiers.
For cases where classification confidence falls below predetermined thresholds (typically 0.5 for individual classifications or 0.6 for overall analysis confidence), the system implements human-in-the-loop mechanisms that flag incidents for expert review. This hybrid approach ensures analytical quality while enabling scalable processing of large incident datasets, maintaining compatibility with the dual-taxonomy architecture and supporting continuous refinement of analytical models and knowledge bases based on expert feedback and emerging UAV operational patterns. The complete system has been deployed as a publicly accessible prototype at https://uav-accident-forensics.streamlit.app/ (accessed on 7 October 2025) for evaluation and testing.
4. Dataset Construction and Annotation
4.1. Data Source and Selection
To address the shortage of HFACS-coded UAV accident data, we developed a dedicated UAV-HFACS dataset using the Aviation Safety Reporting System (ASRS) [], the primary U.S. voluntary safety reporting program operated by NASA. From the complete ASRS repository, we extracted all Unmanned Aircraft System (UAS) reports between 2010 and May 2025, covering three operational categories: Public Aircraft Operations, Recreational Operations/Section 44809, and Part 107 commercial operations. Reports were retained only if they explicitly involved UAV operations, contained sufficiently detailed narratives for human factors analysis, and had complete structured metadata. The selection also ensured diversity across flight phases, control modes, and environmental conditions. Applying these criteria yielded an initial collection of 586 high-quality UAV accident reports.
Each retained record preserved the following ASRS fields essential for HFACS analysis:
- ACN: Unique record identifier for traceability.
- Narrative: Primary analysis text containing detailed event descriptions.
- Synopsis: Auxiliary summary providing context.
- Anomaly: Preliminary classification indicators.
- Flight Phase: Operational phase during the event.
- Control Mode (UAS): Manual, autopilot, or assisted operation.
- Human Factors: Pre-tagged human factor indicators.
- Weather/Environment: Conditions affecting operations.
- Authorization Status: Regulatory compliance and waiver information.
- Visual Observer Configuration: Crew resource management factors.
4.2. Annotation and HFACS Coding
Annotation was conducted using an in-house developed HFACS Expert System (Figure 8), which provided structured workflows, automated consistency checks, and consensus resolution mechanisms. The annotation team consisted of five domain experts with complementary backgrounds in aviation safety engineering, human factors, UAV operations, accident investigation, and aviation psychology. Each record was independently coded according to the HFACS 8.0 framework, systematically covering four hierarchical levels and 18 categories. Level 1 (Unsafe Acts) addressed performance/skill-based errors AE100), judgment and decision-making errors AE200), and known deviations (AD000). Level 2 (Preconditions) covered mental awareness conditions (PC100), state of mind conditions (PC200), adverse physiological conditions (PC300), the physical environment (PE100), technological environment (PE200), team coordination/communication factors (PP100), and training conditions (PT100). Level 3 (Unsafe Supervision) included supervisory climate/unit safety culture (SC000), supervisory known deviations (SD000), ineffective supervision (SI000), and ineffective planning and coordination (SP000). Level 4 (Organizational Influences) encompassed organizational climate/culture (OC000), policy, procedures, or process issues (OP000), resource support problems (OR000), and training program problems (OT000). Corresponding codes can be found in Figure 3. Following annotation, the UAV-HFACS dataset comprises 586 de-identified ASRS UAS reports, each containing the original ACN identifier, normalized narrative text, structured metadata, and HFACS 8.0 labels. The complete dataset contains 10,548 coded data points (586 reports × 18 HFACS categories), with 583 reports (99.5%) containing at least one identified human factor. From this comprehensive collection, a stratified subset of 200 reports (34.1% of total) was systematically selected for experimental validation, ensuring representative coverage across diverse operational contexts and maintaining robust statistical foundation for model performance evaluation.

Figure 8.
HFACS Expert System interface for UAV accident annotation.
To facilitate reproducibility, ACNs are retained to allow matching with supplementary ASRS fields or external datasets. The dataset, along with the HFACS coding guidelines and preprocessing scripts, will be released under a research-only license, enabling its use in safety analysis, human factors research, and AI-assisted accident investigation.
4.3. Quality Assurance and Dataset Characteristics
Quality assurance was ensured through independent coding by all experts, with inter-rater reliability exceeding 85%, and resolution of discrepancies through structured consensus meetings. Each classification was assigned a confidence score on a 0–1 scale, ensuring completeness and consistency across all HFACS categories. From the initial pool, a final experimental subset of 200 records was selected for analysis, prioritizing completeness, temporal continuity, and narrative clarity. The resulting dataset contains 3600 coded data points, including 1522 positive instances (an average of 7.6 factors per record) with full coverage of all HFACS categories. The category distribution reflects realistic UAV accident complexity, with high-frequency factors (≥50%) dominated by decision-making error, cognitive state, and policies/procedures, medium-frequency factors (20–50%) spanning multiple levels (nine categories), and low-frequency factors (<20%) representing rare but critical risks (two categories). Detailed statistics are provided in Table 1. Overall, the UAV-HFACS dataset provides a rigorously coded, quality-assured resource that reflects the operational realities of UAV safety events and establishes a robust foundation for subsequent analytical experiments.

Table 1.
UAV-HFACS dataset statistics and category distribution.
5. Experiments and Results
5.1. Selected UAV-HFACS Subset
The proposed analysis was conducted on a curated subset of the UAV-HFACS dataset. Considering the sample sizes adopted in comparable human factors reasoning tasks and balancing practical constraints of expert review, a total of 200 accident records were selected. This subset was chosen to prioritize completeness of HFACS category coverage, temporal continuity, and narrative clarity. To ensure representativeness, cases were drawn to preserve the natural distribution of UAV operational contexts and environmental conditions.
The final experimental set contains 3600 coded data points, including 1522 positive instances, corresponding to an average of 7.6 HFACS-coded factors per record. All four HFACS levels and 18 categories are represented, with high-frequency factors (≥50%) dominated by judgment & decision-making errors, ineffective supervision, and policy/procedures issues; medium-frequency factors (20–50%) distributed across nine categories spanning multiple HFACS levels; low-frequency factors (<20%) capturing rare but critical risks including supervisory known deviations and adverse physiological conditions. Inter-rater reliability for the subset exceeded 85%, with all discrepancies resolved via structured consensus meetings. Detailed category-level statistics are provided in Table 1.
5.2. Model and Performance Metrics
5.2.1. Model Configuration
Seven state-of-the-art LLMs, spanning proprietary and open-source architectures, were evaluated for HFACS-UAV multi-label classification. Proprietary models comprised GPT-4o, GPT-4o-mini, Gemma3:4B, GPT-4.1-nano, Llama3:8B, GPT-5, and GPT-5-mini, selected for their strong reasoning and NLU performance. Open-source baselines included Llama3:8B and Gemma3:4B (via Ollama) to examine the potential for cost-efficient deployment. All models were run under identical inference parameters—temperature , max_tokens , and top-—to control for configuration-induced variance and enable direct capability comparison.
5.2.2. Domain Knowledge and Few-Shot Enhancement
To address the domain–task transfer gap of general-purpose LLMs, a structured YAML knowledge base was developed containing: (i) definitions of all 18 HFACS categories, (ii) UAV-specific exemplars from real accident cases, (iii) hierarchical L1–L4 mapping, and (iv) operational context guidelines for different mission types. This machine-readable yet human-auditable resource enhanced reproducibility and transparency, allowing models to ground predictions in codified safety taxonomies rather than generic priors (see Figure 3 and Table A1).
A few-shot prompting strategy further supplied curated examples spanning: diverse UAV contexts, varying complexity (single- vs. multi-factor), explicit contrasts across categories, and balanced coverage across HFACS layers. This pairing of structured domain knowledge with representative demonstrations was designed to sharpen decision boundaries and reduce category ambiguity.
5.2.3. Evaluation Metrics
Performance was assessed using standard multi-label classification metrics—Precision, Recall, F1-score, and Accuracy—calculated per HFACS category with macro-averaging. Given the operational imperative to avoid missed contributing factors in early-stage investigations, Recall and F1 were prioritized as primary indicators over Precision. Prediction confidence scores (0–1) were also analyzed to gauge deployment reliability. Ground-truth labels were derived through expert consensus validation of GPT-4.1-mini outputs, providing a balance of reasoning fidelity and computational cost. Statistical robustness was ensured through effect-size analysis with 95% confidence intervals.
5.3. Evaluation Results
5.3.1. Overall Evaluation
All models were evaluated on the same dataset of 200 expert-annotated UAV accident reports. Due to technical limitations and API constraints, some models were unable to process the complete dataset: GPT-5-mini processed 198 records (99.0%), Gemma3:4B processed 193 records (96.5%), GPT-5 processed 197 records (98.5%), and Llama3:8B processed 195 records (97.5%). Performance metrics were calculated on the successfully processed records for each model, ensuring fair comparison within each model’s operational constraints. The performance comparison visualized in Figure 9 demonstrates clear distinctions across the four key metrics. Across all evaluated models, macro-F1 ranged from 0.576 to 0.763 (mean = 0.671, ). GPT-4o-mini demonstrated the best overall balance (F1 = 0.763; Precision = 0.712; Recall = 0.821), accompanied by an average confidence of 0.815. Proprietary models generally outperformed open-source counterparts in balancing precision and recall.

Figure 9.
HFACS-UAV classification performance across evaluated LLMs.
The performance comparison demonstrates clear distinctions across the four key metrics, with GPT-4o-mini showing optimal F1-score performance through superior recall capabilities while maintaining competitive precision. GPT-4o achieved the highest precision (0.817) but suffered from lower recall (0.635), resulting in an F1-score of 0.715 and highlighting the precision–recall trade-off inherent in HFACS classification tasks. GPT-5-mini demonstrated exceptional recall (0.872) with moderate precision (0.599), achieving an F1-score of 0.710 and indicating a tendency toward over-identification of HFACS factors. Confidence scores showed only moderate correlation with performance. For instance, GPT-4o combined high confidence (0.853) with strong F1. Although open-source models lagged substantially behind proprietary counterparts, the performance of Gemma3:4B was broadly comparable to GPT-4.1-nano, indicating that open-source small models, despite lagging behind proprietary counterparts, still demonstrate consistent and potentially scalable performance in HFACS classification tasks.
5.3.2. Sub-Tasks Evaluation
This subsection presents a fine-grained evaluation of seven large language models across the 18 HFACS 8.0 categories, assessed with four standard metrics (F1, recall, precision, and accuracy) on 200 UAV accident reports. Hierarchical-level results are summarized in Figure 10, while category-level envelopes are illustrated in Figure 11. For visual clarity, the figures show only four representative models; complete results for all seven models are provided in Table A2 and Table A3. To ensure comparability, all results were generated under identical protocols and are reported to two decimal places.

Figure 10.
Four-level HFACS radar (L1–L4) F1 Comparison for four models.

Figure 11.
Eighteen-category radar comparison across F1, recall, precision, and accuracy for seven models under HFACS 8.0.
At the hierarchical level, performance declines monotonically from L1 (Unsafe Acts) to L4 (Organizational Influences), consistent with the increasing abstraction of causal constructs and the reduced prevalence of explicit textual cues (Figure 10). L1 remains the most stable across models. At the category level, Figure 11 indicates that detectability is driven by a combination of linguistic salience and empirical frequency. Within L1, AE200 (Decision-Making Error) achieves a peak F1 of 93.87% with GPT-4o, while AE100 (Skill-Based Error) peaks at 81.61% with GPT-4o-mini (Table A2). L2 (Preconditions) shows intermediate performance but higher variance: PE200 benefits from concrete equipment/interface descriptions, reaching F1 = 91.28% with GPT-4o along with high precision and accuracy (93.15% and 93.50%, respectively). By contrast, PC300 is constrained by sparse physiological cues and peaks at only 55.56% (GPT-4o). L3 (Unsafe Supervision) requires broader contextual inference: SI000 and SC000 attain F1 = 80.75% and 77.25%, respectively, on GPT-4o-mini, while SD000 remains particularly challenging, with a best F1 of 46.67% (Table A3). The most difficult categories are clustered in PC300, SD000, and PC200, where signal sparsity (physiological states), rarity (supervisory deviations), or abstraction (management-level decisions) constrain performance. These categories—previously identified as the most challenging—consistently remain in the F1 < 60.00% band across most models, constrained by signal sparsity (physiological states), rarity (supervisory deviations), or abstraction (management-level decisions), as reflected in both tables. Proprietary models are more consistent on supervisory and organizational factors, whereas open-source baselines are comparatively competitive on action-level L1 categories but lag on L3/L4.
A cross-metric view of Table A2 and Table A3 reveals distinct model profiles. GPT-4o is precision/accuracy oriented, attaining the highest precision in PE200 and AE100) and leading accuracy in several L2 categories, thereby favoring low false-positive rates. GPT-4o-mini provides the most balanced F1 across layers, reflecting a robust precision–recall trade-off. GPT-5-mini prioritizes recall, achieving perfect recall in PE200 (100%) and elevated recall in supervisory categories (e.g., SP000 = 95.35%). This recall-oriented profile maximizes factor coverage but requires downstream verification to control false positives. Category-level disparities can be substantial, highlighting the sensitivity of HFACS-specific capability to model architecture, pre-training distribution, and optimization objectives.
Frequency effects are evident: high-support categories (e.g., AE200, OP000) consistently achieve stable, higher scores, whereas rarer or more abstract categories (PC300, SD000, OT000) remain constrained across all models. This pattern suggests that targeted data augmentation, HFACS-aware prompting, and expert-in-the-loop workflows could preferentially enhance recall and calibration for rare but safety-critical supervisory and organizational factors.
These observations inform deployment strategies that should be aligned with specific operational priorities. For a robust F1 baseline with broad stability, GPT-4o-mini provides the most balanced trade-off. When false positives are costly and overall accuracy is paramount, GPT-4o is better suited for compliance-oriented review. For early-stage safety screening where sensitivity is prioritized, GPT-5-mini maximizes factor capture and can be paired with human verification or cascaded filtering. Open-source models are currently most effective for action-level analyses at L1, while their performance at L3/L4 improves when combined with domain knowledge bases or weakly supervised adaptation strategies. The qualitative envelopes shown in Figure 10 and Figure 11 align closely with the quantitative rankings presented in Table A2 and Table A3.
5.3.3. Error Pattern Analysis
To provide transparency and guide future improvements, we analyzed the error patterns in GPT-4o’s predictions. We identified 304 false positives (hallucinations, 8.57%) and 654 false negatives (omissions, 18.44%), achieving an overall accuracy of 72.98%. We categorized the errors into distinct patterns (Table 2). The predominance of omissions over hallucinations (2.15:1 ratio) indicates a conservative bias. Notably, organizational-level factors (Level 4) were most frequently missed (37.5% of all omissions), while precondition factors (Level 2) showed the highest hallucination rate, particularly for psychological factors such as PC100 (Adverse Mental States, 48 cases).

Table 2.
Error Pattern Analysis for GPT-4o HFACS Classification.
Representative Examples:
- Category Confusion (ACN 1294113): A UAV pilot filed a NOTAM but forgot to contact the Tower via phone before each flight. Ground truth: AE200 (Perceptual Errors—memory lapse). GPT-4o prediction: AE100 (Skill-Based Errors). The model confused cognitive failure with skill deficiency.
- Level Confusion (ACN 1652001): UAV operator reported miscommunication with ATC during flight. Ground truth: PC100 + PP100 (Level 2—preconditions). GPT-4o prediction: AE100 (Level 1—unsafe act). The model attributed the error to operator skill rather than underlying workload/stress conditions.
- Organizational Blindness (ACN 879418): Pilot allowed non-qualified person in pilot seat during mission. Ground truth included OP000 (Organizational Process) and OR000 (Resource Management). GPT-4o correctly identified supervisory failures but missed organizational root causes.
These error patterns suggest targeted improvements including contrastive learning for category distinction, hierarchical reasoning for level attribution, and enhanced attention to organizational factors in complex incidents.
5.4. Ablation Studies
5.4.1. Design and Reporting Protocol
This ablation study systematically evaluates the independent and combined effects of two components—a YAML-based HFACS/UAV knowledge resource and in-context few-shot exemplars—using a controlled factorial design. All experiments use GPT-4o-mini with identical decoding settings (temperature ). Prompts differ only in the presence/absence of the two components. The evaluation set comprises 200 expert-annotated UAV accident reports in a multi-label regime over 18 HFACS 8.0 categories. Macro-F1 is computed as the unweighted average of per-category F1 scores across all 18 categories, giving each category equal weight regardless of frequency.
We compare four configurations: Baseline (No-YAML, zero-shot),+YAML (YAML, zero-shot), +Few-shot (No-YAML, few-shot), and Full (YAML, few-shot). The YAML resource supplies HFACS 8.0 definitions, hierarchical relations, and UAV terminology in a machine-readable format, ensuring transparency and reproducibility, but excludes any hand-crafted narrative-to-label rules (see Figure 3 and Table A1. Few-shot exemplars are drawn from a pool disjoint from the evaluation set to prevent information leakage, and all other hyperparameters are kept fixed across conditions. We report Macro-F1 and absolute/relative changes versus the Baseline. No additional inferential statistics are introduced.
5.4.2. Results and Practical Implications
As shown in Figure 12, The Baseline configuration achieves a Macro-F1 of 0.594, indicating non-trivial inherent HFACS reasoning ability in GPT-4o-mini.+YAML improves to 0.644 (absolute +0.050, relative +8.4%), showing that structured domain knowledge improves performance even without exemplars. +Few-shot reaches 0.653 (absolute +0.059, relative +10.0%), suggesting slightly stronger gains from contextual examples than from knowledge alone. The Full configuration reaches a Macro-F1 of 0.705, an absolute gain of +0.111 (+18.7%) over Baseline, demonstrating the complementary benefits of structured knowledge and contextual exemplars.

Figure 12.
Ablation with a design. (A) Macro-F1 across four configurations, improving from Baseline to Full. (B) Component contributions showing YAML formatting and few-shot learning gains with minimal interaction effects. (C) Level-wise performance (L1–L4) demonstrating hierarchical degradation from concrete unsafe acts to abstract organizational factors. (D) Absolute improvements by level, with larger gains for L3 supervision and L2 preconditions factors.
An additive attribution analysis indicates balanced contributions: approximately 53% of the total gain stems from few-shot exemplars, 45% from the YAML knowledge base, and only 2% from interaction effects (i.e., the difference between the Full improvement and the sum of the individual improvements). This near-additive behavior suggests that the two components operate through largely independent mechanisms: the YAML resource contributes systematic domain structure, while exemplars reinforce contextual pattern recognition. Figure 12 summarizes these findings: panel (A) shows the monotonic progression from Baseline to Full, panel (B) visualizes the additive attribution, and panels (C,D) present the level-wise effects.
Category-level responses (Figure 13) align with the ablation trends. High-frequency, linguistically salient categories show strong Baseline performance and modest incremental gains under Full—e.g., L1 Decision-Making Errors: 0.85 → 0.94 (, ) and L4 Policy/Procedures: 0.68 → 0.76 (, ), consistent with ceiling effects. In contrast, categories requiring more complex inference benefit disproportionately when both components are combined—L2 Mental Awareness: 0.58 → 0.81 (+0.23, +39.7%); L3 Ineffective Supervision: 0.37 → 0.81 (+0.44, +118.9%). Technical factors exhibit a nuanced interaction: L2 Technological Environment peaks under +Few-shot () and slightly declines under Full (), indicating mild interference between definitions and exemplar-driven cues. All remaining categories improve monotonically from Baseline to Full, including L4 Resource Management: 0.42 → 0.70 (+0.28, +66.7%).

Figure 13.
Category-specific responses to component ablation for six representative HFACS categories.
For practical deployment, we advocate a modular integration approach. Under resource constraints, prioritizing few-shot exemplars is advisable, as they typically yield larger marginal gains and are more cost-effective to implement initially. For comprehensive, cross-level coverage—particularly at higher HFACS tiers—the Full configuration is preferable, providing the most balanced hierarchical improvements. Because interactions between the two components are minimal, a phased rollout is natural: begin with exemplars to secure rapid gains, then incorporate the YAML knowledge base to broaden coverage and lock in predictable, incremental improvements. In technical subdomains, lightly calibrating the specificity of YAML definitions helps avoid over-constraining exemplar-driven inference.
5.5. AI Models vs. Human Expert Evaluation
To further evaluate the performance of large language models in UAV accident investigation, we compared the performance of three advanced AI models (GPT-5, GPT-5-mini, and GPT-4.1-nano) with that of human experts, focusing on L1 (Unsafe Acts) and L2 (Preconditions for Unsafe Acts) HFACS categories using standardized coding. This comparison provides crucial insights into the potential and limitations of AI-assisted HFACS analysis for the most directly observable and inferable factors from accident narratives. The evaluation encompasses ten categories: three L1 categories (AE100 Performance/Skill-Based Error, AE200 Judgment & Decision-Making Error, AD000 Known Deviation) and seven L2 categories (PC100 Mental Awareness Conditions, PC200 State of Mind Conditions, PC300 Adverse Physiological Conditions, PE100 Physical Environment, PE200 Technological Environment, PP100 Team Coordination/Communication Factors, PT100 Training Conditions). Human-expert performance was measured on the same 200 reports using our HFACS Expert System (Figure 8), with scores computed against adjudicated consensus labels from the five-expert annotation team, thereby controlling for the inherent challenges of inferring cognitive and physiological states from narrative descriptions.
The comparison reveals distinct performance patterns across L1 and L2 HFACS categories. Human experts achieve the highest average F1-score (0.728), followed by GPT-5-mini (0.669), GPT-5 (0.652), and GPT-4.1-nano (0.608), with an approximately 8.1% relative gap between human experts and the best-performing AI model. As shown in Figure 14, Category-level patterns are clear: in L1, GPT-5-mini surpasses humans in AE200 (Decision-Making Error, 0.909 vs. 0.860) and AD000 (Known Deviation, 0.755 vs. 0.730), whereas humans lead AE100 (Skill-Based Error, 0.800 vs. 0.772 for the best AI). In L2, GPT-5-mini leads PC100 (Mental Awareness, 0.792 vs. 0.690), GPT-4.1-nano leads PP100 (Team Coordination, 0.719 vs. 0.710), and humans remain ahead in PC200 (State of Mind, 0.650 vs. 0.466 for the best AI), PC300 (Physical Conditions, 0.600 vs. 0.480), PE100 (Physical Environment, 0.770 vs. 0.757), PE200 (Technological Environment, 0.830 vs. 0.804), and PT100 (Training Conditions, 0.640 vs. 0.553). Overall, AI models lead in four of the ten categories (GPT-5-mini: AE200, AD000, PC100; GPT-4.1-nano: PP100), while humans lead in the remaining six; the largest human advantage occurs in PC200 (State of Mind), underscoring the difficulty of inferring complex psychological states from narrative text. These findings argue for a complementary division of labor: deploy AI for systematic decision analysis and explicit violation/cognitive cues, and allocate human expertise to psychological-state assessment, skill-based evaluations, and nuanced environmental interpretation—favoring category-specific collaboration over uniform automation.

Figure 14.
LLMs vs. human experts on HFACS L1–L2: per-category F1 across ten standardized categories.
6. Conclusions
This study developed a UAV accident forensics system that integrates the HFACS 8.0 human factors framework with advanced LLM reasoning, supporting the identification of the underlying human factors in unstructured narrative incident reports. By coupling HFACS 8.0’s structured taxonomy with the inferential power of an LLM, the system can systematically extract contributory factors across all causal tiers—from frontline unsafe acts and preconditions to supervisory and organizational influences—in UAV operations. In our evaluation, the HFACS-LLM approach demonstrated the ability to detect subtle narrative cues and latent error patterns that human analysts may overlook, thereby augmenting traditional investigation methods with greater consistency and depth. Beyond analytical accuracy, this approach can reduce the time, cost, and human effort required for UAV accident reporting and analysis, enabling more timely collection and assessment of incidents that might otherwise go unreported. Such efficiency can help narrow the gap between real-world occurrences and research data availability, creating a more complete foundation for evidence-based safety improvements.
However, several limitations must be acknowledged. First, the system’s performance is constrained by the quality and detail of incident narratives; incomplete or biased reports could diminish the accuracy of its inferences. Second, the models were developed and validated on specific UAV operational data, so their generalizability to other UAV domains or mission profiles may be limited without further adaptation. Third, in high-stakes investigative contexts, human expert validation remains essential to interpret AI-generated conclusions and maintain accountability. Addressing these challenges in future work will involve training on more diverse incident datasets and incorporating additional contextual data (e.g., flight telemetry or environmental conditions) to improve the models’ robustness. Moreover, integrating a more structured human-in-the-loop validation step or an ensemble-based analytical framework could further enhance the reliability and acceptance of AI-driven forensics. By addressing these limitations, the HFACS-LLM reasoning approach may evolve into a trusted component of next-generation UAV safety management, ultimately contributing to stronger regulatory assessments and safer autonomous flight operations.
Author Contributions
Conceptualization, Y.Y.; Methodology, Y.Y. and B.L.; Software, Y.Y.; Data curation, B.L.; Writing—original draft, Y.Y. and B.L.; Writing—review & editing, B.L. and G.L.; Supervision, B.L. and G.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the China Scholarship Council (CSC) scholarship (Grant No. 202410320015).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
This research utilized publicly available, anonymized ASRS incident reports released by NASA for safety research purposes. No personally identifiable information was processed or retained. The study was conducted in accordance with institutional research ethics guidelines, and all expert validation participants provided informed consent.
Data Availability Statement
All experimental data, code, and Appendix A are available in the project repository https://github.com/YukeyYan/UAV-Accident-Forensics-HFACS-LLM (accessed on 7 October 2025).
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. HFACS-UAV Patterns and Performance Metrics
This appendix presents the 26 UAV-specific enhancement patterns and complete performance metrics for all evaluated models.

Table A1.
UAV-Specific Enhancement Patterns for HFACS.
Table A1.
UAV-Specific Enhancement Patterns for HFACS.
Pattern Name | Definition | Representative Case Example |
---|---|---|
Level 4: Organizational Influences | ||
Part 107 Compliance | Organizational challenges in maintaining compliance with FAA Part 107 small UAS regulations | Commercial operator failed to maintain current remote pilot certificates for 40% of flight crew, resulting in unauthorized operations |
LAANC Authorization | Organizational processes for Low Altitude Authorization and Notification Capability | Delivery company’s automated LAANC requests contained systematic altitude errors, causing 15 unauthorized airspace penetrations |
VLOS Requirements | Organizational management of Visual Line of Sight operational requirements | Survey company’s policy allowed operations up to 2 miles without visual observers, violating VLOS requirements |
BVLOS Operations | Organizational challenges in Beyond Visual Line of Sight operations | Research institution conducted BVLOS flights without proper waiver, lacking required detect-and-avoid systems |
Battery Constraints | Organizational resource management for power system limitations | Fleet operator’s inadequate battery replacement schedule led to 30% capacity degradation and multiple forced landings |
Energy Management | Organizational strategies for UAV energy and endurance management | Emergency response team lacked standardized battery management protocols, causing mission failures during critical operations |
Level 3: Unsafe Supervision | ||
GCS Interface Complexity | Supervisory challenges in managing Ground Control Station interface complexity | Supervisor failed to ensure pilot training on new GCS software (QGroundControl v4.2.8), leading to mode confusion during critical mission |
Delayed Feedback Systems | Supervisory management of delayed feedback in remote operations | Operations manager did not account for 200 ms video latency in flight planning, causing multiple near-miss incidents |
LAANC Integration | Supervisory oversight of LAANC system integration and usage | Flight supervisor approved operations without verifying LAANC authorization status, resulting in controlled airspace violation |
Airspace Authorization | Supervisory management of airspace authorization processes | Chief pilot failed to establish procedures for real-time airspace status monitoring, causing TFR violations |
Emergency Procedures | Supervisory oversight of UAV-specific emergency procedures | Training supervisor did not require C2 link loss simulation, leaving pilots unprepared for actual communication failure |
Level 2: Preconditions for Unsafe Acts | ||
C2 Link Reliability | Command and Control link reliability as precondition for safe operations | Pilot lost C2 link for 45 s due to interference, unable to regain control before UAV entered restricted airspace |
Telemetry Accuracy | Accuracy and timeliness of telemetry data transmission | GPS altitude readings were 50 ft inaccurate due to multipath interference, causing terrain collision during low-altitude survey |
Range Limitations | Communication and control range limitations affecting operations | Pilot attempted operation at 3 mile range despite 2-mile equipment limitation, resulting in complete signal loss |
Signal Degradation | Progressive degradation of communication signals during operations | Gradual signal weakening over 10 min went unnoticed, leading to delayed response during emergency maneuver |
Mode Confusion | Confusion between different flight modes and automation levels | Pilot believed UAV was in manual mode but was actually in GPS hold, causing unexpected behavior during landing approach |
Automation Dependency | Over-reliance on automated systems affecting manual skills | When autopilot failed, pilot’s degraded manual flying skills resulted in hard landing and structural damage |
Weather Sensitivity | UAV sensitivity to weather conditions affecting operations | 15-knot crosswind exceeded small UAV’s capability, but pilot continued operation resulting in loss of control |
Spatial Disorientation | Loss of spatial orientation due to remote operation characteristics | Pilot lost orientation during FPV flight in featureless terrain, unable to determine UAV attitude or position |
Visual Limitations | Limited visual references and environmental cues in remote operations | Camera’s limited field of view prevented detection of approaching aircraft until collision was unavoidable |
Battery Limits | Battery capacity and performance limitations affecting operations | Cold weather reduced battery performance by 40%, causing unexpected power loss during return flight |
Automation Degradation | Degradation of automated system performance over time | Uncalibrated IMU caused gradual drift in autonomous flight path, leading to controlled airspace violation |
Level 1: Unsafe Acts | ||
Autopilot Interactions | Errors in interaction with autopilot systems and mode management | Pilot incorrectly programmed waypoint altitude as 400ft AGL instead of MSL, causing terrain collision |
Flight Mode Switching | Errors during transitions between different flight modes | Rapid switching between manual and GPS modes during emergency caused control oscillations and crash |
Limited Visual References | Errors due to limited visual references in remote operations | Pilot misjudged distance to obstacle due to camera perspective, resulting in collision with power line |
Delayed Feedback | Errors caused by delayed feedback from remote systems | 300 ms video delay caused pilot to overcorrect during landing, resulting in hard impact and damage |

Table A2.
Performance comparison of four metrics across 18 HFACS categories (Part A: AE100–PP100, values in %). Bold values denote per-category best; underlined values denote the second-best.
Table A2.
Performance comparison of four metrics across 18 HFACS categories (Part A: AE100–PP100, values in %). Bold values denote per-category best; underlined values denote the second-best.
Metric | Model | AE100 | AE200 | AD000 | PC100 | PC200 | PC300 | PE100 | PE200 | PP100 |
---|---|---|---|---|---|---|---|---|---|---|
F1-score | GPT-4o-mini | 81.61 | 87.82 | 65.80 | 80.87 | 44.12 | 55.32 | 74.42 | 84.88 | 78.87 |
GPT-4o | 60.11 | 93.87 | 81.77 | 79.85 | 46.88 | 55.56 | 79.55 | 91.28 | 78.69 | |
GPT-5-mini | 74.46 | 90.91 | 75.51 | 79.22 | 40.82 | 41.67 | 75.73 | 71.09 | 68.67 | |
Gemma3:4B | 76.92 | 87.46 | 56.51 | 77.67 | 22.22 | 29.27 | 71.05 | 63.30 | 72.34 | |
GPT-4.1-nano | 77.15 | 88.76 | 61.18 | 58.29 | 46.60 | 28.57 | 64.44 | 75.86 | 71.93 | |
GPT-5 | 65.59 | 90.00 | 57.38 | 75.91 | 42.70 | 48.00 | 66.67 | 80.45 | 70.18 | |
Llama3:8B | 75.64 | 86.63 | 56.72 | 39.51 | 0.00 | 0.00 | 32.73 | 63.08 | 8.96 | |
Recall | GPT-4o-mini | 99.19 | 100.00 | 96.20 | 89.60 | 35.71 | 61.90 | 74.42 | 96.05 | 86.15 |
GPT-4o | 44.72 | 98.71 | 93.67 | 85.60 | 35.71 | 47.62 | 81.40 | 89.47 | 73.85 | |
GPT-5-mini | 70.49 | 97.40 | 93.67 | 97.60 | 71.43 | 71.43 | 90.70 | 100.00 | 87.69 | |
Gemma3:4B | 100.00 | 100.00 | 100.00 | 100.00 | 14.63 | 30.00 | 64.29 | 93.24 | 85.00 | |
GPT-4.1-nano | 83.74 | 99.35 | 98.73 | 46.40 | 57.14 | 57.14 | 67.44 | 72.37 | 63.08 | |
GPT-5 | 50.83 | 88.82 | 45.45 | 85.25 | 45.24 | 60.00 | 92.86 | 96.00 | 62.50 | |
Llama3:8B | 99.16 | 98.68 | 97.44 | 26.45 | 0.00 | 0.00 | 21.43 | 55.41 | 4.76 | |
Precision | GPT-4o-mini | 69.32 | 78.28 | 50.00 | 73.68 | 57.69 | 50.00 | 74.42 | 76.04 | 72.73 |
GPT-4o | 91.67 | 89.47 | 72.55 | 74.83 | 68.18 | 66.67 | 77.78 | 93.15 | 84.21 | |
GPT-5-mini | 78.90 | 85.23 | 63.25 | 66.67 | 28.57 | 29.41 | 65.00 | 55.15 | 56.44 | |
Gemma3:4B | 62.50 | 77.72 | 39.38 | 63.49 | 46.15 | 28.57 | 79.41 | 47.92 | 62.96 | |
GPT-4.1-nano | 71.53 | 80.21 | 44.32 | 78.38 | 39.34 | 19.05 | 61.70 | 79.71 | 83.67 | |
GPT-5 | 92.42 | 91.22 | 77.78 | 68.42 | 40.43 | 40.00 | 52.00 | 69.23 | 80.00 | |
Llama3:8B | 61.14 | 77.20 | 40.00 | 78.05 | 0.00 | 0.00 | 69.23 | 73.21 | 75.00 | |
Accuracy | GPT-4o-mini | 72.50 | 78.50 | 60.50 | 73.50 | 81.00 | 89.50 | 89.00 | 87.00 | 85.00 |
GPT-4o | 63.50 | 90.00 | 83.50 | 73.00 | 83.00 | 92.00 | 91.00 | 93.50 | 87.00 | |
GPT-5-mini | 70.20 | 84.85 | 75.76 | 67.68 | 56.06 | 78.79 | 87.37 | 69.19 | 73.74 | |
Gemma3:4B | 62.69 | 77.72 | 39.38 | 64.25 | 78.24 | 84.97 | 88.60 | 58.55 | 79.79 | |
GPT-4.1-nano | 69.50 | 80.50 | 50.50 | 58.50 | 72.50 | 70.00 | 84.00 | 82.50 | 84.00 | |
GPT-5 | 67.51 | 84.77 | 73.60 | 66.50 | 74.11 | 86.80 | 80.20 | 82.23 | 82.74 | |
Llama3:8B | 61.03 | 76.41 | 40.51 | 49.74 | 79.49 | 88.72 | 81.03 | 75.38 | 68.72 |

Table A3.
Performance comparison of four metrics across 18 HFACS categories (Part B: PT100–OT000, values in %). Bold values denote per-category best; underlined values denote the second-best.
Table A3.
Performance comparison of four metrics across 18 HFACS categories (Part B: PT100–OT000, values in %). Bold values denote per-category best; underlined values denote the second-best.
Metric | Model | PT100 | SC000 | SD000 | SI000 | SP000 | OC000 | OP000 | OR000 | OT000 |
---|---|---|---|---|---|---|---|---|---|---|
F1-score | GPT-4o-mini | 59.41 | 77.25 | 46.67 | 80.75 | 68.45 | 66.29 | 88.43 | 70.45 | 57.73 |
GPT-4o | 39.47 | 51.32 | 37.50 | 75.86 | 44.63 | 41.18 | 89.36 | 46.51 | 53.66 | |
GPT-5-mini | 51.23 | 74.18 | 40.00 | 80.16 | 64.57 | 72.46 | 85.80 | 70.04 | 45.63 | |
Gemma3:4B | 60.87 | 58.89 | 22.22 | 51.72 | 51.43 | 53.42 | 85.98 | 8.60 | 21.95 | |
GPT-4.1-nano | 35.00 | 69.17 | 30.30 | 64.89 | 32.14 | 68.62 | 69.96 | 62.88 | 38.96 | |
GPT-5 | 55.32 | 48.18 | 36.36 | 60.96 | 60.24 | 48.18 | 60.17 | 44.93 | 43.75 | |
Llama3:8B | 0.00 | 73.83 | 26.47 | 76.45 | 12.77 | 25.21 | 83.65 | 2.22 | 0.00 | |
Recall | GPT-4o-mini | 52.63 | 85.71 | 58.33 | 84.92 | 73.56 | 55.77 | 96.13 | 66.67 | 51.85 |
GPT-4o | 26.32 | 37.14 | 50.00 | 69.84 | 31.03 | 26.92 | 94.84 | 32.26 | 40.74 | |
GPT-5-mini | 94.55 | 76.70 | 66.67 | 79.84 | 95.35 | 73.53 | 90.85 | 91.21 | 90.38 | |
Gemma3:4B | 63.64 | 53.00 | 16.67 | 36.89 | 43.37 | 43.43 | 92.62 | 4.49 | 17.31 | |
GPT-4.1-nano | 24.56 | 79.05 | 83.33 | 57.94 | 20.69 | 78.85 | 59.35 | 77.42 | 27.78 | |
GPT-5 | 45.61 | 32.04 | 33.33 | 45.97 | 58.14 | 32.35 | 46.71 | 33.33 | 38.89 | |
Llama3:8B | 0.00 | 77.45 | 75.00 | 81.15 | 7.23 | 14.85 | 88.67 | 1.12 | 0.00 | |
Precision | GPT-4o-mini | 68.18 | 70.31 | 38.89 | 76.98 | 64.00 | 81.69 | 81.87 | 74.70 | 65.12 |
GPT-4o | 78.95 | 82.98 | 30.00 | 83.02 | 79.41 | 87.50 | 84.48 | 83.33 | 78.57 | |
GPT-5-mini | 35.14 | 71.82 | 28.57 | 80.49 | 48.81 | 71.43 | 81.29 | 56.85 | 30.52 | |
Gemma3:4B | 58.33 | 66.25 | 33.33 | 86.54 | 63.16 | 69.35 | 80.23 | 100.00 | 30.00 | |
GPT-4.1-nano | 60.87 | 61.48 | 18.52 | 73.74 | 72.00 | 60.74 | 85.19 | 52.94 | 65.22 | |
GPT-5 | 70.27 | 97.06 | 40.00 | 90.48 | 62.50 | 94.29 | 84.52 | 68.89 | 50.00 | |
Llama3:8B | 0.00 | 70.54 | 16.07 | 72.26 | 54.55 | 83.33 | 79.17 | 100.00 | 0.00 | |
Accuracy | GPT-4o-mini | 79.50 | 73.50 | 92.00 | 74.50 | 70.50 | 70.50 | 80.50 | 74.00 | 79.50 |
GPT-4o | 77.00 | 63.00 | 90.00 | 72.00 | 66.50 | 60.00 | 82.50 | 65.50 | 81.00 | |
GPT-5-mini | 50.00 | 72.22 | 87.88 | 75.25 | 54.55 | 71.21 | 76.77 | 64.14 | 43.43 | |
Gemma3:4B | 76.68 | 61.66 | 92.75 | 56.48 | 64.77 | 61.14 | 76.68 | 55.96 | 66.84 | |
GPT-4.1-nano | 74.00 | 63.00 | 77.00 | 60.50 | 62.00 | 62.50 | 60.50 | 57.50 | 76.50 | |
GPT-5 | 78.68 | 63.96 | 92.89 | 62.94 | 66.50 | 63.96 | 52.28 | 61.42 | 72.59 | |
Llama3:8B | 69.74 | 71.28 | 74.36 | 68.72 | 57.95 | 54.36 | 73.33 | 54.87 | 73.33 |
References
- Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned Aerial Vehicles (UAVs): A Survey on Civil Applications and Key Research Challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
- Floreano, D.; Wood, R.J. Science, technology and the future of small autonomous drones. Nature 2015, 521, 460–466. [Google Scholar] [CrossRef]
- Sun, J.; Li, B.; Jiang, Y.; Wen, C.y. A Camera-Based Target Detection and Positioning UAV System for Search and Rescue (SAR) Purposes. Sensors 2016, 16, 1778. [Google Scholar] [CrossRef]
- Federal Aviation Administration. Accidents and Incidents: UAS Operator Responsibilities. AIM §11-8-4. 2025. Available online: https://www.faa.gov/air_traffic/publications/atpubs/aim_html/chap11_section_8.html (accessed on 27 July 2025).
- Kasprzyk, P.J.; Konert, A. Reporting and Investigation of Unmanned Aircraft Systems (UAS) Accidents and Serious Incidents—Regulatory Perspective. J. Intell. Robot. Syst. 2021, 103, 3. [Google Scholar] [CrossRef]
- Ghasri, M.; Maghrebi, M. Factors affecting unmanned aerial vehicles’ safety: A post-occurrence exploratory data analysis of drones’ accidents and incidents in Australia. Saf. Sci. 2021, 139, 105273. [Google Scholar] [CrossRef]
- Johnson, C.; Holloway, C. A survey of logic formalisms to support mishap analysis. Reliab. Eng. Syst. Saf. 2003, 80, 271–291. [Google Scholar] [CrossRef]
- Federal Aviation Administration. 14 CFR § 107.9—Safety Event Reporting. 2025. Available online: https://www.ecfr.gov/current/title-14/part-107/section-107.9 (accessed on 27 July 2025).
- National Transportation Safety Board. The Investigative Process. 2025. Available online: https://www.ntsb.gov/investigations/process/Pages/default.aspx (accessed on 27 July 2025).
- Australian Transport Safety Bureau. Reporting Requirements for Remotely Piloted Aircraft (RPA). 2021. Available online: https://www.atsb.gov.au/reporting-requirements-rpa (accessed on 15 August 2025).
- European Parliament and Council of the European Union. Regulation (EU) No 376/2014 on the Reporting, Analysis and Follow-Up of Occurrences in Civil Aviation. 2014. Available online: https://eur-lex.europa.eu/eli/reg/2014/376/oj (accessed on 7 October 2025).
- European Commission. Commission Implementing Regulation (EU) 2019/947 of 24 May 2019 on the Rules and Procedures for the Operation of Unmanned Aircraft. 2019. Available online: https://eur-lex.europa.eu/eli/reg_impl/2019/947/oj (accessed on 7 October 2025).
- Grindley, B.; Phillips, K.; Parnell, K.J.; Cherrett, T.; Scanlan, J.; Plant, K.L. Over a decade of UAV incidents: A human factors analysis of causal factors. Appl. Ergon. 2024, 121, 104355. [Google Scholar] [CrossRef]
- Boyd, D.D. Causes and risk factors for fatal accidents in non-commercial twin engine piston general aviation aircraft. Accid. Anal. Prev. 2015, 77, 113–119. [Google Scholar] [CrossRef] [PubMed]
- Liu, Q.; Li, F.; Ng, K.K.H.; Han, J.; Feng, S. Accident Investigation via LLMs Reasoning: HFACS-guided Chain-of-Thoughts Enhance General Aviation Safety. Expert Syst. Appl. 2025, 269, 126422. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual, 6–12 December 2020; Volume 33. Available online: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf (accessed on 7 October 2025).
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Proceedings of the Advances in Neural Information Processing Systems: 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Nice, France, 2023; Volume 36. [Google Scholar]
- Bran, A.M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A.D.; Schwaller, P. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 2024, 6, 525–535. [Google Scholar] [CrossRef]
- Lu, Y.; Xu, C.; Wang, Y. Joint Computation Offloading and Trajectory Optimization for Edge Computing UAV: A KNN-DDPG Algorithm. Drones 2024, 8, 564. [Google Scholar] [CrossRef]
- Chen, M.; Zhang, H.; Wan, C.; Zhao, W.; Xu, Y.; Wang, J.; Gu, X. On the effectiveness of large language models in domain-specific code generation. arXiv 2023, arXiv:2312.01639. [Google Scholar]
- Almusayli, A.; Zia, T.; Qazi, E.u.H. Drone Forensics: An Innovative Approach to the Forensic Investigation of Drone Accidents Based on Digital Twin Technology. Technologies 2024, 12, 11. [Google Scholar] [CrossRef]
- NASA Aviation Safety Reporting System. ASRS Database Fields. Overview of ASRS Coding form Fields and Their Display/Search Availability. 2024. Available online: https://akama.arc.nasa.gov/ASRSDBOnline/pdf/ASRS_Database_Fields.pdf (accessed on 7 October 2025).
- Wiegmann, D.A.; Shappell, S.A. A Human Error Analysis of Commercial Aviation Accidents Using the Human Factors Analysis and Classification System (HFACS); Technical Report DOT/FAA/AM-01/3; U.S. Department of Transportation, Federal Aviation Administration, Office of Aviation Medicine: Washington, DC, USA, 2001. [Google Scholar]
- Department of the Air Force. Human Factors Analysis and Classification System (DoD HFACS) Version 8.0 Handbook; Technical Report; Air Force Safety Center, Department of the Air Force: Kirtland AFB, NM, USA, 2023; Available online: https://navalsafetycommand.navy.mil/Portals/100/Documents/DoD%20Human%20Factors%20Analysis%20and%20Classification%20System%20%28HFACS%29%208.0.pdf (accessed on 7 October 2025).
- NASA Aviation Safety Reporting System. ASRS Coding Taxonomy (ASRS Coding Form). Official Field List and Categorical Taxonomy for ASRS Database Online. 2024. Available online: https://asrs.arc.nasa.gov/docs/dbol/ASRS_CodingTaxonomy.pdf (accessed on 7 October 2025).
- International Civil Aviation Organization. Annex 13 to the Convention on International Civil Aviation—Aircraft Accident and Incident Investigation, 13th ed.; International Civil Aviation Organization: Montreal, QC, Canada, 2024. [Google Scholar]
- Wild, G.; Murray, J.; Baxter, G. Exploring civil drone accidents and incidents to help prevent potential air disasters. Aerospace 2016, 3, 22. [Google Scholar] [CrossRef]
- Tvaryanas, A.P.; Thompson, W.T.; Constable, S.H. US Military Unmanned Aerial Vehicle Mishaps: Assessment of the Role of Human Factors Using HFACS; U.S. Military Unmanned Aerial Vehicle Mishaps: Assessment of the Role of Human Factors Using Human Factors Analysis and Classification System (HFACS); Technical Report HSW-PE-BR-TR-2005-0001; United States Air Force, 311th Human Systems Wing, Performance Enhancement Directorate, Performance Enhancement Research Division: Brooks City-Base, TX, USA, 2005; Available online: https://archive.org/details/DTIC_ADA435063 (accessed on 7 October 2025).
- Federal Aviation Administration. Small Unmanned Aircraft Systems (UAS) Regulations (Part 107); Federal Aviation Administration: Washington, DC, USA, 2025. Available online: https://www.faa.gov/newsroom/small-unmanned-aircraft-systems-uas-regulations-part-107 (accessed on 15 August 2025).
- FAA Center of Excellence for Unmanned Aircraft Systems (ASSURE). Identify Flight Recorder Requirements for Unmanned Aircraft Systems Integration into the National Airspace System (Project A55)—Final Report; Technical Report; Mississippi State University: Starkville, MS, USA, 2024; Available online: https://assureuas.org/wp-content/uploads/2021/06/A55_Final-Report.pdf (accessed on 7 October 2025).
- Reason, J. Managing the Risks of Organizational Accidents; Ashgate: Aldershot, UK, 1997. [Google Scholar]
- Reason, J. Human error: Models and management. BMJ 2000, 320, 768–770. [Google Scholar] [CrossRef] [PubMed]
- Reason, J. The contribution of latent human failures to the breakdown of complex systems. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 1990, 327, 475–484. [Google Scholar] [CrossRef]
- Shappell, S.A.; Wiegmann, D.A. The Human Factors Analysis and Classification System (HFACS); Technical Report DOT/FAA/AM-00/7; Federal Aviation Administration, Office of Aviation Medicine: Washington, DC, USA, 2000. [Google Scholar]
- Wiegmann, D.A.; Shappell, S.A. A Human Error Approach to Aviation Accident Analysis: The Human Factors Analysis and Classification System; Ashgate: Aldershot, UK, 2003. [Google Scholar]
- El Safany, R.; Bromfield, M.A. A human factors accident analysis framework for UAV loss of control in flight. Aeronaut. J. 2025, 129, 1723–1749. [Google Scholar] [CrossRef]
- Alharasees, O.; Kale, U. Human Factors and AI in UAV Systems: Enhancing Operational Efficiency Through AHP and Real-Time Physiological Monitoring. J. Intell. Robot. Syst. 2025, 111, 5. [Google Scholar] [CrossRef]
- Tanguy, N.; Tulechki, N.; Urieli, A.; Hermann, E.; Raynal, C. Natural Language Processing for Aviation Safety Reports: From Classification to Interactive Analysis. Comput. Ind. 2016, 78, 80–95. [Google Scholar] [CrossRef]
- Chandra, C.; Jing, X.; Bendarkar, M.V.; Sawant, K.; Elias, L.R.; Kirby, M.; Mavris, D.N. AviationBERT: A Preliminary Aviation-Specific Natural Language Model. In Proceedings of the AIAA AVIATION 2023 Forum, San Diego, CA, USA, 12–16 June 2023. [Google Scholar] [CrossRef]
- Andrade, S.R.; Walsh, H.S. SafeAeroBERT: Towards a Safety-Informed Aerospace-Specific Language Model. In Proceedings of the AIAA Aviation 2023 Forum, San Diego, CA, USA, 12–16 June 2023. [Google Scholar] [CrossRef]
- Wang, L.; Chou, J.; Zhou, X.; Tien, A.; Baumgartner, D.M. AviationGPT: A Large Language Model for the Aviation Domain. arXiv 2023, arXiv:2311.17686. [Google Scholar] [CrossRef]
- Yang, C.; Huang, C. Natural Language Processing (NLP) in Aviation Safety: Systematic Review of Research and Outlook into the Future. Aerospace 2023, 10, 600. [Google Scholar] [CrossRef]
- Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Applications of Natural Language Processing in Aviation Safety: A Review and Qualitative Analysis. arXiv 2025, arXiv:2501.06210. [Google Scholar] [CrossRef]
- Taranto, M.T. A Human Factors Analysis of USAF Remotely Piloted Aircraft Mishaps. Master’s Thesis, Naval Postgraduate School, Monterey, CA, USA, 2013. Available online: https://hdl.handle.net/10945/34751 (accessed on 7 October 2025).
- Light, T.; Hamilton, T.; Pfeifer, S. Trends in U.S. Air Force Aircraft Mishap Rates (1950–2018); Technical Report RRA257-1; RAND Corporation, Santa Monica, CA, USA, 2020. Available online: https://www.rand.org/pubs/research_reports/RRA257-1.html (accessed on 7 October 2025).
- Zibaei, E.; Borth, R. Building Causal Models for Finding Actual Causes of Unmanned Aerial Vehicle Failures. Front. Robot. AI 2024, 11, 1123762. [Google Scholar] [CrossRef] [PubMed]
- Editya, A.S.; Ahmad, T.; Studiawan, H. Visual Instruction Tuning for Drone Accident Forensics. HighTech Innov. J. 2024, 5, 870–884. [Google Scholar] [CrossRef]
- Clothier, R.A.; Williams, B.P.; Fulton, N.L. Structuring the Safety Case for Unmanned Aircraft System Operations in Non-Segregated Airspace. Saf. Sci. 2015, 79, 213–228. [Google Scholar] [CrossRef]
- Bhatt, U.; Xiang, A.; Sharma, S.; Weller, A.; Taly, A.; Jia, Y.; Ghosh, J.; Puri, R.; Moura, J.M.F.; Eckersley, P. Explainable Machine Learning in Deployment. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 648–657. [Google Scholar] [CrossRef]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the Advances in Neural Information Processing Systems: 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 24824–24837. Available online: https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html (accessed on 7 October 2025).
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems: 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 22199–22213. [Google Scholar]
- Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. arXiv 2022, arXiv:2210.03493. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. Autogen: Enabling next-gen LLM applications via multi-agent conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar] [CrossRef]
- Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST’23), San Francisco, CA, USA, 29 October–1 November 2023; ACM: New York, NY, USA, 2023. 22p. [Google Scholar] [CrossRef]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. arXiv 2023, arXiv:2308.00352. [Google Scholar] [CrossRef]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. [Google Scholar] [CrossRef]
- Jiang, Z.; Xu, F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 7969–7992. [Google Scholar] [CrossRef]
- Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
- Leveson, N.G. An Introduction to System Safety Engineering; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2020), Virtual, 6–12 December 2020; Volume 33, pp. 9459–9474. [Google Scholar]
- National Transportation Safety Board. Aviation Accident Database & Synopses. 2025. Available online: https://www.ntsb.gov/Pages/AviationQueryV2.aspx (accessed on 7 October 2025).
- U.S. Government Accountability Office. Drones: FAA Should Improve Its Approach to Integrating Drones into the National Airspace System; Technical Report GAO-23-105189; U.S. Government Accountability Office: Washington, DC, USA, 2023. [Google Scholar]
- Braun, V.; Clarke, V. Using thematic analysis in psychology. Qual. Res. Psychol. 2006, 3, 77–101. [Google Scholar] [CrossRef]
- Hasson, F.; Keeney, S.; McKenna, H. Research guidelines for the Delphi survey technique. J. Adv. Nurs. 2000, 32, 1008–1015. [Google Scholar] [CrossRef]
- Diamond, I.R.; Grant, R.C.; Feldman, B.M.; Pencharz, P.B.; Ling, S.C.; Moore, A.M.; Wales, P.W. Defining consensus: A systematic review recommends methodologic criteria for reporting of Delphi studies. J. Clin. Epidemiol. 2014, 67, 401–409. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Lawshe, C.H. A quantitative approach to content validity. Pers. Psychol. 1975, 28, 563–575. [Google Scholar] [CrossRef]
- Leveson, N. Engineering a Safer World: Systems Thinking Applied to Safety; MIT Press: Cambridge, MA, 2011. [Google Scholar]
- Pearl, J. Causality: Models, Reasoning, and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).