A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents

Nanyonga, Aziida; Wasswa, Hassan; Joiner, Keith; Turhan, Ugur; Wild, Graham

doi:10.3390/modelling6020027

Open AccessArticle

A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents

by

Aziida Nanyonga

¹

,

Hassan Wasswa

²

,

Keith Joiner

³

,

Ugur Turhan

⁴

and

Graham Wild

^4,*

¹

School of Engineering and Technology, University of New South Wales, Canberra, ACT 2600, Australia

²

School of Systems and Computing, University of New South Wales, Canberra, ACT 2600, Australia

³

Capability Systems Centre, University of New South Wales, Canberra, ACT 2600, Australia

⁴

School of Science, University of New South Wales, Canberra, ACT 2612, Australia

^*

Author to whom correspondence should be addressed.

Modelling 2025, 6(2), 27; https://doi.org/10.3390/modelling6020027

Submission received: 15 February 2025 / Revised: 19 March 2025 / Accepted: 21 March 2025 / Published: 25 March 2025

Download

Browse Figures

Versions Notes

Abstract

The timely identification of probable causes in aviation incidents is crucial for averting future tragedies and safeguarding passengers. Typically, investigators rely on flight data recorders; however, delays in data retrieval or damage to the devices can impede progress. In such instances, experts resort to supplementary sources like eyewitness testimonies and radar data to construct analytical narratives. Delays in this process have tangible consequences, as evidenced by the Boeing 737 MAX accidents involving Lion Air and Ethiopian Airlines, where the same design flaw resulted in catastrophic outcomes. To streamline investigations, scholars advocate for natural language processing (NLP) and topic modelling methodologies, which organize pertinent aviation terms for rapid analysis. However, existing techniques lack a direct mechanism for deducing probable causes. To bridge this gap, this study trains and evaluates the performance of a transformer-based model in predicting the likely causes of aviation incidents based on long-input raw text analysis narratives. Unlike traditional models that classify incidents into predefined categories such as human error, weather conditions, or maintenance issues, the trained model infers and generates the likely cause in a human-like narrative, providing a more interpretable and contextually rich explanation. By training the model on comprehensive aviation incident investigation reports like those from the National Transportation Safety Board (NTSB), the proposed approach exhibits promising performance across key evaluation metrics, including BERTScore with Precision: (M = 0.749, SD = 0.109), Recall: (M = 0.772, SD = 0.101), F1-score: (M = 0.758, SD = 0.097), Bilingual Evaluation Understudy (BLEU) with (M = 0.727, SD = 0.33), Latent Semantic Analysis (LSA similarity) with (M = 0.696, SD = 0.152), and Recall Oriented Understudy for Gisting Evaluation (ROUGE) with a precision, recall and F-measure scores of (M = 0.666, SD = 0.217), (M = 0.610, SD = 0.211), (M = 0.618, SD = 0.192) for rouge-1, (M = 0.488, SD = 0.264), (M = 0.448, SD = 0.257), M = 0.452, SD = 0.248) for rouge-2 and (M = 0.602, SD = 0.241), (M = 0.553, SD = 0.235), (M = 0.5560, SD = 0.220) for rouge-L, respectively. This demonstrates its potential to expedite investigations by promptly identifying probable causes from analysis narratives, thus bolstering aviation safety protocols.

Keywords:

NLP in aviation safety; aviation incidents analysis; BERT; multi-head attention; transformers

1. Introduction

Establishing the cause of an aviation incident or accident to prevent it from re-occurring in the future is the core goal of any aviation safety occurrence investigation and analysis. Hereafter, aviation accidents will be considered a subset of aviation incidents. Conventionally, whenever an investigation is deemed necessary in the event of an aviation incident or accident, the primary source of information is usually the Cockpit Voice Recorder (CVR) and Flight Data Recorder (FDR) devices [1]. Data from these two devices are vital in giving an account of what was happening within the cockpit and the input to the aircraft received from the pilot, respectively, minutes before and at the time of the incident. However, retrieving these two devices can take months or even years and in the worst case, the devices become severely damaged during or after the incident, making the data irretrievable [2]. In such cases, where the data on the devices do not give conclusive findings or are not readily available for the investigations to start, the experts often divert their attention to other sources which can include eyewitnesses, pilot reports, air traffic controllers, satellite images, radar information, damaged aircraft components, and weather stations readings at the time of the incident [3]. This gathered information is often prepared and presented as a narrative describing the series of events and conditions under which the incident occurred. This information is then analysed by experts to establish the likely cause of the incident [4], allowing them to suggest possible measures that can deter such incidents from happening again.

However, this entire process is time-consuming, and in the event of a design flaw, until the cause is established and a preventative measure designed and implemented, the lives of passengers flying with such an aircraft model remain at risk. As an example, the flaw in the design of Boeing 737 MAX’s Manoeuvring Characteristics Augmentation System (MCAS) feature which in certain circumstances counteract the pilots’ input caused two fatal accidents including the crash of Lion Air (JT610) [5] flight followed, five months later, by Ethiopian Airlines Flight 302 (ET-302). Both aircraft crashed a few minutes after taking-off killing all 189 and 157 people on board respectively [5,6]. If the cause of the Lion Air accident had been established quickly and acted upon appropriately, the ET-302 [6] accident would likely have been avoided. With the aim of shortening aviation incident investigation time, and allowing the quick establishment of the cause, researchers have proposed various natural language processing (NLP) and topic modelling-based approaches like latent Dirichlet allocation (LDA), latent semantic analysis (LSA), parallel latent Dirichlet allocation (PLDA) [7,8,9,10,11]. These proposed schemes analyse and group aviation terms with related meanings or that are connected to a given phase of flight, field of aviation, flight conditions, and/or causes into related topics. Such approaches could help the investigation team to establish the area of concentration and consequently, lead to quick establishment of the causes.

While previous studies have demonstrated remarkable success in predicting accident severity and identifying likely causal factors through classification and term clustering techniques, such as topic modelling, the outputs of these models often lack interpretability and fail to provide contextually rich explanations of probable causes. This study investigates the potential of a transformer-based model to infer the likely cause of aviation incidents in a human-like narrative, thereby enhancing interpretability and contextual depth by generating explanations directly from the initial analysis narrative. Transformer models like bidirectional encoder representations from transformers (BERTs) [12] and their variants [13,14,15,16,17,18] have demonstrated cutting-edge performances across various challenging NLP tasks. Since analysis narratives often contain long textual paragraphs, the researchers hypothesized that the resulting model would produce enhanced performance if based upon recent studies on long-input transformers [19,20,21,22] which have revealed that increasing the transformer’s input length positively correlates with model performance.

The training methodology employed for the transformer model in this study adheres to the core principles of language translation transformers. However, it diverges in a key aspect: rather than using the masked input to the transformer’s decoder as the target language during training, it uses the target probable cause. When evaluated on the NTSB dataset, the model demonstrated the potential of transformers to enhance the efficiency of aviation incident investigations by generating probable causes based on the analysis narrative. This study makes a two-fold contribution:

A generative model based on a multi-head attention transformer is trained and evaluated to generate the probable cause of an aviation incident based on the raw textual analysis of event narratives preceding, during, or following an accident. Given that the training dataset comprises both long and short input narratives, the model effectively processes diverse input lengths, thereby expediting the investigation process and enhancing air transport safety.
Many aviation incident datasets have instances with analysis narratives but no corresponding entries of the probable cause. Consequently, a significant number of instances are removed during data preprocessing, thereby reducing the volume of training data that could otherwise improve model learning. This data reduction negatively impacts the model’s ability to generalize to new instances. By using the model trained in this work, the missing probable cause entries can be inferred from the available analysis narratives, ultimately enhancing model performance and improving generalization to unseen data.

The rest of this paper is organized as follows: Section 2 presents a review of prior related work followed by Section 3 where a detailed description of our approach is presented. In Section 4, the findings of this study are presented. In Section 5, a detailed discussion of the findings is presented, highlighting the contributions and limitations of our study. Finally, Section 6 gives concluding remarks, highlighting the direction of future work.

2. Related Work

The utilization of machine learning and deep learning methods and techniques in aviation analysis and prediction has garnered increasing attention from aviation safety researchers. This interest is driven by objectives such as expediting aviation incident investigations, promptly determining the causes of incidents for swift mitigation of future occurrences, predicting incidents, and extracting knowledge to enhance air transport safety. This section delves into key prior studies that have employed AI-based techniques in alignment with aviation safety.

Burnett et al. [23] trained four conventional ML classifiers, including decision trees, KNN, SVM, and ANN with back propagation for prediction of aviation injuries and fatalities. The authors employed a cross validation training approach with 10 folds and looked at how factors like pilots’ accumulated flight hours and age impacted the rate of injuries and fatalities. Experimental results revealed ANN to be superior for the task when evaluated on datasets sourced from Federal Aviation Administration (FAA) between 1975 and 2002 inclusive.

Nanyonga et al. [24,25] utilized NLP and other AI to analyze text narratives, aiming to determine aircraft damage levels from safety incidents. Four learning models: long short-term memory (LSTM), bidirectional LSTM (BLSTM), and gated recurrent units (GRU), simple recurrent neural networks (sRNNs), and hybrid architecture models including GRU+LSTM, sRNN+BLSTM+GRU, etc., were assessed on 27,000 NTSB reports. Results indicated all models achieved over 87.9% accuracy, surpassing random guessing (25%) for a four-class problem. However, LSTMs and RNN-based models have inherent limitations in handling long-text dependencies due to their sequential nature, which could hinder performance on longer aviation narratives.

Another study [26] assessed the risk created by various anomalies in aviation events using of a hybrid classifier constituting proposed a hybrid model comprising an SVM and several neural networks. The four-step method involved all events being categorized into five risk-level groups, followed by application of an SVM model to determine the link between textual event synopses and the resulting consequences. Next, the hybrid model was trained to capture the correlations between contextual event attributes and risk-level groups. A fusion rule was then proposed to combine outcomes from the two models and finally, a stochastic-base decision tree was used to predict the risk level. The limitation of this approach lies in its dependence on a hybrid of traditional ML models, which may not capture complex relationships in textual data as effectively as deep learning models like transformers.

Zhang and Mahadevan [27] and Valdés et al. [28] deployed Bayesian inference-based techniques for aviation incident modelling and analysis. their study aimed to forecast aircraft safety incidents by employing an inventive statistical method. This method utilized Bayesian inferences and hierarchical structures to build learning models of varying complexities and goals. In contrast, Valdés et al. [27] focused on analysing commercial aviation accidents spanning the period between 1982 and 2006, as documented by the NTSB. This second study proposed a four-phase approach to build a Bayesian network capable of capturing the relationship between the sequence of events that led to the accidents. The methodology encompassed creating a graphical representation for visualizing aviation accident events, forming a Bayesian network representation by amalgamating the graphical representations of all accidents while accounting for the causal and dependent relationships between aircraft damage and personnel injury.

In study [29], two models—ResNet and simple RNN—were trained and evaluated to classify the phase of flight during which the incident happened. Various NLP-based techniques were sequentially deployed including word tokenization, punctuation, unwanted character and stopword removal, lemmatization operations, and word2vec transformation of the unstructured textual analysis narratives extracted from the NTSB aviation incident investigation reports. The models recorded a classification accuracy of more than 68% on a seven-class classification problem.

In study [30], Nanyonga et al., carried out a comparative study of two topic modelling analysis techniques: LDA and non-negative matrix factorization (NMF) regarding aviation accident reports. Using the coherence value for performance evaluation the quality of generated topics was evaluated with LDA, displaying superior topic coherence and indicating its robustness in extracting semantic connections among words within topics. NMF, on the other hand, showcased exceptional performance in line with generating unique and detailed topics, facilitating a more targeted examination of particular aspects of aviation accidents.

Their study [31] showcased an automated text classification approach utilizing machine learning that could enhance analysts’ efficiency by accurately categorizing “Occurrence” in aviation incident reports, thereby enabling more precise querying of reporting databases. Using a random forest algorithm to classify more than 45,000 textural reports, an accuracy of 80–93% was recorded based on the ICAO “Occurrence” Category. The authors also conducted text cleaning that encompassed use of standard NLP techniques including stemming, removal of irrelevant words and symbols like stop words, punctuation characters and other special symbols, and then deployed the n-gram techniques including bi-gram, tri-gram, etc., for feature extraction prior to passing the reports to the ML algorithm for classification.

Studies including [32,33,34,35] deployed NLP-based techniques including topic modelling and text classification for information extraction from and analysis of aviation incident reports and have reported competitive results regarding causal factor analysis like human factor analysis, aviation incident risk classification, aircraft damage classification, aviation report clustering and grouping, and many other AI-based tasks. While these studies offer valuable insights into the use of NLP for aviation safety, they often rely on conventional NLP techniques such as LDA or SVMs, which are limited in their ability to capture nuanced contextual relationships in lengthy incident narratives.

Recent studies have proposed innovative approaches to improving predictive models in aviation and engineering. Ning Shen et al. [36] developed a combined model for aero-engine life prediction, integrating long short-term memory (LSTM) networks with a multi-headed attention mechanism and an autoregressive integrated moving average (ARIMA) model. This hybrid approach enhances accuracy in predicting the remaining useful life (RUL) of engines by focusing on critical time features and leveraging the advantages of linear feature extraction.

Also, Zhaofei Li et al. [37] introduced the transformer–TCN self-attention network (TTSNet) for RUL prediction of aircraft engines, utilizing exponential smoothing, normalization, and multi-branch attention mechanisms to capture both global and local time-series features. Their model demonstrated superior performance compared to traditional methods, as evidenced by improved RMSE and score metrics across various datasets. Xiuxun Liu et al. [38] proposed an enhanced GNSS positioning error estimation model based on a multi-layer perceptron and attention mechanisms. Their approach significantly reduced positioning errors, outperforming state-of-the-art models like LSTM and CNN. Additionally, Huali Cai et al. [39] addressed the challenge of multi-label classification for customer complaints in the aviation industry. They introduced the MAG model, which combines BERT, attention mechanisms, and multi-channel feature extraction networks to improve classification accuracy. Their model’s ability to better extract text features and learn inter-feature relationships provides new insights into the optimization of service quality in aviation.

While previous NLP architectures such as LSTMs have been widely used for text generation tasks, they exhibit limitations when handling long-text dependencies due to their sequential nature. The multi-head attention mechanism in transformers allows for better parallelization and contextual understanding, making them well-suited for analyzing extensive aviation narratives. Additionally, long-input transformer models have demonstrated superior performance in processing lengthy text sequences [22,40], making them an optimal choice for our task.

One research gap revealed in our literature review concerns attention-based transformers. Despite the attention-based transformer models achieving outstanding performance on various NLP tasks, including machine translation [41,42,43], text summarization [44,45], text simplification [46,47], grammatical error correction [48,49], and question answering [50], little-to-no attention has been paid to their deployment in the field of aviation safety to establish the likely causes of an aviation incident given the raw text analysis narrative. The work in this study aims to close this knowledge gap by proposing and training a transformer-based model for such tasks.

3. Proposed Approach

3.1. Dataset

Several aviation and transport safety agencies such as the Australian Transport Safety Bureau (ATSB), Aviation Safety Reporting System (ASRS), and the NTSB, actively gather and release reports detailing aviation incident investigations. This study utilized aviation incident reports provided by the NTSB. These reports, along with accompanying metadata, are available on the NTSB’s website in a variety of formats, such as monthly published .pdf documents, .json files, or by querying individual reports through their online platform. A summarized version in .csv format can also be obtained. For this study, the researchers focused on .json files containing detailed incident investigations from the years 2001 to 2020. Importantly, the only included incidents were investigations that had been concluded, resulting in a dataset comprising 29,676 cases. From each report, the analysisNarrative were extracted and probableCause sections to facilitate model training and validation processes. Additionally, a comprehensive statistical analysis was carried out to assess the distribution of text lengths within these fields. It was found that the average length of the analysisNarrative was 1116 words, while the probableCause field averaged 165 words, with standard deviations of 858.36 and 93.12, respectively. Further examination revealed that the shortest analysisNarrative entry contained only 4 words, while the longest reached 36,544 words. In comparison, the probableCause field ranged from 7 to 1600 words. Figure 1 and Figure 2 visually represent the distribution of text lengths for both the analysisNarrative and probableCause fields.

3.2. Data Pre-Processing

Data pre-processing involved removing HTML tags and urls, transforming wrongly encoded characters; that is, characters encoded with the ASCII equivalent codes were decoded to their natural language characters. Also, although several statistical methods including the five-point summary-based multiplier method, the Z-score method, and extreme percentile methods, can be employed to establish the upper bound for outlier removal, given the highly skewed distribution of text length, the five-point summary-based multiplier method and the Z-score method were found to be inefficient for this study. Consequently, the extreme percentile method was utilized to determine the upper bound for out-of-range text narratives. The 99.9 th percentile

P_{99.9}

was applied, yielding upper bounds of 10,022.45 for “analysisNarrative” and

909.71

for “probableCause”. These values were subsequently rounded to 10,000 and 1000, respectively, and reports whose analysisNarrative entries were longer than 10,000 words, and probableCause entries longer than 1000 words, were treated as outliers and discarded for this study.

3.3. The Transformer

The transformer architecture, introduced in study [40], represented a revolutionary advancement in the NLP domain, yielding remarkable outcomes. Departing from traditional RNN models, the transformer employs multi-head self-attention, enabling parallel processing and overcoming the limitations of sequential training inherent in conventional RNNs. This self-attention mechanism not only enhances computational efficiency but also captures intricate dependencies among various text components. As described by the authors, the attention process involves associating a given query (Q), with key (K)–value (V) pairs for sequence generation. Within this framework, Q, keys K, V, and the prediction are expressed as vectors. The resulting sequence is computed through a weighted summation of the V entries, with each value’s weight determined by a passing a scaled-dot product of Q and K vectors through a softmax function as depicted in Equation (1). Figure 3 shows the architecture of the transformer model and the architectural components of its encoder, decoder and output blocks.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

where;

$d_{k}$ is the projection dimension of keys, K.
T transposes K to allow matrix multiplication.

In order to facilitate concurrent processing, the multi-head self-attention mechanism utilizes several linear projections of Q, K, and V, each mapped to dimensions

d_{k}

,

d_{k}

and

d_{v}

respectively. These parallel operations generate outputs within the

d_{v}

-dimensional space, which are combined and mapped again to derive the ultimate V entries. This approach results in a model capable of simultaneously attending to information across many representational vector subspaces at various locations. The multi-head self-attention mechanism, featuring p heads, is defined as presented in Equation (2).

M u l t i H e a d (Q, K, V) = c a n c a t (h e a d_{i}, \dots, h e a d_{p}) W^{a}

(2)

where

$W^{a} \in R^{p d_{v} \times d_{m o d e l}}$ represents a learnable weight matrix that is used to project the concatenated outputs of all attention heads into the desired output dimension.
$h e a d_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i})$
$d_{model}$ is the dimension of the input and output embeddings through the transformer model.

3.4. Experimental Setup

Various studies, including [51,52,53], have demonstrated that training large transformer models has led to significant advancements in natural language processing (NLP) and computer vision. Specifically, these studies trained multi-head attention-based transformers with 8 attention heads and exhibited improved performance. Consequently, the model architecture comprised 8 attention heads, with an embedding dimension of 1024 [54,55] to accommodate long input sequences and 8 encoders and decoder layers to handle the inherent complexity of the long row analysis narrative inputs. A dropout rate of

0.1

was applied at each batch normalization layer and feed-forward layer to enhance regularization and mitigate overfitting. Additionally, the inner-layer dimension of the feed-forward network was configured to 2048 and the vocabulary sizes for both the “analysisNarrative” and “probableCause” components were set to 100,000—determined using a word counter function. The model was trained on

90 %

of the dataset while the remaining

10 %

was used for testing following study [56] in which this split ratio produced the best prediction results. Training was performed for 50 epochs using a learning rate of

0.001

with the Adam optimizer, betas were set to

(0.95, 0.96)

, epsilon set to

1 \times 10^{- 10}

, batch-size set to 64, and cross-entropy was the loss function.

All coding was implemented in Python 3.8.10 on a server with 256 cores and 512 GB of RAM. Used libraries include: Pandas (1.5.0) for reading and managing dataframes, Numpy (1.22.4) for performing mathematical operations and categorical data transformations, and matplotlib for performance visualization. The Pytorch framework was used for building the model while libraries including sentence-bleu, score, and rouge-score were used for model performance scoring. All experiments were conducted in a Jupyter notebook on a Linux server with 256 CPU cores and 256 GB RAM, running Ubuntu 5.4.0-169-generic.

3.5. Performance Metrics

To evaluate the quality of generated probable cause, three metrics commonly used in tasks involving natural language generation problems such as text summarizing, machine translation, question answering, and grammatical error correction are used in this study.

3.5.1. Bilingual Evaluation Understudy (BLEU)

BLEU [57] deploys an n-gram based evaluation metric approach that is extensively utilized in machine translation assessment. It is precision-centric and assesses the degree of overlap between n-grams from the target and generated texts. This overlap is insensitive to word position, except for n-gram term associations. However, BLEU imposes a brevity penalty when the generated text is substantially shorter than the reference text. In addition to machine translation, BLEU finds application in problems where the input and output use the same natural language, including grammatical error correction [58,59], summarization [60,61], and text simplification [47,62], which involves rewriting a sentence into one or more simpler sentences. The BLEU score can be computed using Equation (3) [57].

B L E U = B P \cdot exp (\sum_{i}^{N} (w_{i} \cdot ln p_{i}))

(3)

where;

$B P \to$ Brevity Penalty, calculated using Equation (4);
$w_{i} \to$ order i n-gram precision’s weight;
$p_{i} \to$ n-gram’s modified precision score of order i;
$N \to$ maximum n-gram order to consider.

B P = e x p (1 - \frac{l_{p}}{l r_{a v g}})

(4)

$l_{p} \to$ length of predicted cause
$l r_{a v g} \to$ average length of reference cause.

3.5.2. Recall Oriented Understudy for Gisting Evaluation (ROUGE)

ROUGE [63] applies a definition similar to that of BLEU. However, unlike BLEU which emphasizes precision, ROUGE’s emphasis is on recall. ROUGE comes in three main versions [64,65]: n-rouge, primarily examining n-gram overlap (such as 2-rouge and 1-rouge for 2-grams and 1-gram, respectively); L-rouge, which evaluates the Longest Common Text Sub-sequence; and s-rouge, emphasizing skip grams. Like BLEU, ROUGE finds application in both machine translation and in problems where the input and output use the same natural language, including summarizing [66,67,68], grammatical error correction [65,69], and text simplification [70,71,72], which involves rewriting a sentence into one or more simpler sentences. For each of rouge-1, rouge-2 and rouge-L, the precision, recall, and F-measure are calculated using Equations (5)–(7) [63].

P r e c i s i o n = \frac{C o u n t_{m n - g r a m - a p}}{C o u n t_{n - g r a m - p}}

(5)

R e c a l l = \frac{C o u n t_{m n - g r a m - a p}}{C o u n t_{n - g r a m - a}}

(6)

F 1 - S c o r e = 2 \times \frac{p r e c i s o n \times r e c a l l}{p r e c i s o n + r e c a l l}

(7)

where

$C o u n t_{m n - g r a m - a p}$ is the number of n-grams from the target probable cause matching with the predicated probable cause.
$C o u n t_{n - g r a m - p}$ is the count of n-grams in predicted probable cause.
$C o u n t_{n - g r a m - a}$ is the count of n-grams in actual probable cause.

3.5.3. Latent Semantic Analysis (LSA)

LSA [73], presented in 1997 by Landauer and Dumais in [74], calculates the semantic similarity between a reference sentence and the model’s generated sentence. It relies on pre-computed word co-occurrence counts from a large corpus. Employing the bag of words (BOW) approach, it treats word order as irrelevant. Unlike ROUGE and BLEU, LSA is lenient on variations in word choice, such as “hard” versus “difficult”. In essence, LSA encodes sentences or documents into vectors using a bag of words technique. These vectors enable the computation of similarity metrics, such as cosine similarity, to assess the likeness between generated and target texts. Like BLEU and ROUGE, LSA has seen application in measuring the output quality of various natural language generation models including text summarizing, grammatical correction, translation, and text simplification [75,76,77,78,79]. The cosine similarity between sequences,

s_{1}

and

s_{2}

, can be obtained by converting the sequences to numeric vectors,

v_{1}

and

v_{2}

, and then using Equation (8) for similarity calculation [80].

S i m i l a r i t y (v_{1}, v_{2}) = \frac{d o t (v_{1}, v_{2})}{| | v_{1} | | \times | | v_{2} | |}

(8)

3.5.4. BERTScore

The BERTScore [81] performance metric was introduced as an alternative to conventional evaluation metrics, including BLEU, LSA, and ROUGE. It is particularly advantageous for assessing the quality of text summarization by measuring the semantic similarity between a generated summary and the original text. Traditional evaluation metrics often fail to accurately match paraphrases, as they rely on surface-level text comparisons and may misjudge semantically equivalent expressions that differ in wording, resulting in inaccurate performance assessments. Additionally, n-gram-based models are limited in their ability to capture long-range dependencies and tend to penalize meaningful reordering of text, further compromising evaluation accuracy.

4. Results

For model inference, instances from the test set were fed into the model, and it generated a probable cause for each analysis narrative. The example of training cases are shown in Figure 4, where random samples of analysis narratives from the test set are passed to the model. The model generated almost semantically perfect probable causes concerning each input narrative.

4.1. Model Performance Based on the BERTScore

The BERTScore was computed for each test sample and the average and standard deviation of the BERT-precision (

B E R T_{P}

), BERT-recall (

B E R T_{R}

), and BERT-F1 (

B E R T_{F 1}

) were recorded as (M = 0.749, SD = 0.109), (M = 0.772, SD = 0.101), and (M = 0.758, SD = 0.097), respectively. A scatter distribution of the obtained BERTScore values between (probable cause, predicted probable cause) pairs for the first 1000 incident narratives is shown in Figure 5.

4.2. Model Performance Based on the BLEU Score

The BLEU Score was used to measure how closely the predicted probable cause matched the reference probable cause. For each pair of sentences, BLEU gives a value between 0 and 1, with 1 indicating a perfect match. The minimum n-gram order was set to 1 while N was set to 4 for this work. After a series of evaluations with various random samples of size 500 from the test set, in comparison with results from other metrics, the weight vector, w was set to

(0.1, 0.1, 0, 0)

.

For each instance in our test set, the BLEU score was computed, recording a mean score of

0.727

with a standard deviation of

- / +

0.330

. A scatter distribution of the obtained BLEU scores between the first 1000 (probable cause, predicted probable cause) pairs is shown in Figure 6.

4.3. Model Performance Based on the LSA Similarity Score

LSA similarity gives the semantic similarity between vector representations of the output probable cause and target probable cause. It represents the semantic similarity rather than lexical similarity. A high similarity score implies that the sequences have closer meanings. This is similar to the case for BLEU scores, as the (probable_cause, predicted_probable_cause) pair was obtained for each instance in the test set.

Each component of the pair was then converted into its numeric vector representation using Google’s pretrained Universal Sentence Encoder version 4, which is the latest version at the time of writing this paper. Universal Sentence Encoder models were introduced by Google Researchers in study [82], where the cosine similarity was deployed consequently placing vector embeddings of semantically similar words close to each other. The pretrained Universal- Sentence Encoder model used in this work can be downloaded from the TensorFlow hub. The Universal Sentence Encoder was accessed on 11 December 2020 at (https://tfhub.dev/google/universal-sentence-encoder/4). Our model recorded a mean LSA similarity score of

0.697

with a standard deviation of −/+

0.153

. A distribution of the obtained similarity scores is visualized in Figure 7.

4.4. Model Performance Based on the ROUGE Scores

For ROUGE Scores, this study considered n-rouge (Rouge-1, Rouge-2) and L-rouge (Rouge-L). These scores measure the overlap of n-grams between the candidate and reference sentences. Rouge-1 gives score from unigrams, Rouge-2 gives score from bi-grams, while Rouge-L gives the score from the longest common sub-sequence. Higher scores indicate better overlap between the sentences.

4.5. analysisNarrative Length vs. BLEU/LSA Scores

Further investigations were carried out on how the length of the input analysis narrative impacted the model’s output in terms of the BLEU and LSA similarity scores. The results revealed that the analysis narrative length had no direct correlation with the model’s BLEU score as shown in Figure 8. On the other hand, the LSA similarity score shows no correlation with the length of the input analysis narrative for shorter inputs. However, it tends to converge to the mean score as the length of the analysis pattern increases as shown in Figure 9. This finding emphasizes the researchers’ hypothesis which stated that working with long input sequences would enhance the model’s predictive performance. This also emphasises the finding of prior studies on long-input transformers including [19,20,21,22] which revealed that increasing the transformer’s input length positively correlates with model performance.

5. Discussion

Having the ability to predict the probable causes of an aviation incident can greatly expedite the investigation process. The results from this study revealed that a multi-head attention-based transformer model is a tool for solving this problem. However, although the model recorded commendable results across all metrics, by their formula, the LSA similarity score is more reliable compared to the BLEU and ROUGE metrics. This is because the model’s output and reference sentence can constitute a different set words for the same semantic content. Since the LSA similarity score computes the overall semantic similarity between the sentences, it will more likely produce a high score if the two sentences are similar and vice versa. On the other hand, the BLEU score requires that the weight vector, w for each n-gram is manually determined. This means that the final BLEU score greatly depends on the accuracy of the values of w which requires human expert and if wrongly determined can lead to misleading results. Also, the computation of BLEU and ROUGE scores like the uni-gram, bi-gram, etc., depend on the overlap of words between the reference and predicted sentence, that is, observed probable cause and predicted probable cause for this study.

For instance, considering the output in the screenshot in Figure 10, the reference probable cause as given in the dataset is as follows:

“The mechanic’s improper maintenance of the main transmission aft pinion nut and belt drive system, which resulted in the uncoupling of the tail rotor driveshaft and the subsequent loss of helicopter control.”

While the model’s prediction, given the same analysis narrative, is as follows:

“The failure of the main rotor drive belts due to a loss of belt tension on the main rotor drive system as a result of maintenance personnel’s failure to properly secure the nut and the helicopters main rotor drive belts.”

Although the semantic meanings of the two narratives are close and would both draw the incident investigator’s attention to the same component and attribute the failure to the maintenance personnel’s not properly securing the nut and belt drive system, BLEU scores differed across different weight vector values. On the other hand, the ROUGE scores were Rouge-1: precision = 0.476, recall = 0.606, F-measure = 0.533; Rouge-2: precision = 0.146, recall = 0.188, F-measure = 0.164; and Rouge-L: precision = 0.310, recall = 0.394, F-measure = 0.347 as shown in Table 1.

As it can be seen in Table 2, the results from the BLEU score largely depend on the values of vector w. It is also clear that the score greatly degrades when w contains entries for the tri-gram and quad-gram which correspond to the third and fourth entries of w, respectively. The value is also misleading for very small entries of the uni-gram and bi-gram as seen when w is set to

(0.01, 0.01, 0, 0)

.

After a performance comparison with results for BERTscore, LSA-score, and ROUGE, and considering BLEU scores for the different the weight vectors in Table 2,

w = (0.01, 0.01, 0, 0)

was chosen for BLEU scoring.

Generally, the recorded scores in the case of ROUGE metrics are relative more reliable for the Rouge-1 and Rouge-L. The Rouge-2 has recorded poor performance due to the fact that the word sequence in the reference text does not always overlap with the word sequence in the model’s output. For the example output in Figure 10, the recorded ROUGE scores are poor in terms of precision, recall, and F-measure for all the three n-grams used in this study despite the semantic meaning being very similar. On the other hand, because the LSA returns the semantic similarity between two text sequences, its output is considerably high (

0.757

) for this particular example indicating that despite the discrepancies in the used set of words, the semantic meaning is greatly similar.

Finally, the LSA Similarity score’s input length-model performance analysis indicated that training the model with long inputs can result in stable model performance as the score converged to the mean score with increasing input length (see Figure 9).

Examining just research article titles for keywords in machine learning for aviation safety shows there has been significant concurrent research on the topic. Table 3 shows 12 research publications alongside the current study, one of which is by the team here. The table also maps the key features of this study to show its significance against current research. Three of the studies have none of our selected features in common such as transformer approach to predicting probable cause.

Limitations

Training a highly efficient transformer model necessitates a substantial amount of training data, which posed a significant limitation in this study. As a result, while the model generated correct predictions for the majority of “analysisNarrative” entries, it also produced instances of insufficient information and, in some cases, incorrect predictions. Table 4 presents examples where the model generated incomplete or inaccurate predictions in relation to the observed “probableCause” entries.

The second limitation is that the primary objective of this study was limited to assessing the effectiveness of the standard transformer model architecture in predicting the probable cause of an aviation incident based on the raw-text incident analysis narratives while the exploration of alternative models, such as Longformer, LongT5, BigBird, and BigBirdPegasus, among others, was deferred to future research.

Finally, the standard multi-head self-attention transformer architecture exhibits quadratic scaling with respect to the input size, n, with a time complexity of

O (n^{2} d_{model})

. Consequently, both the training and inference times in this study increased quadratically with the input size. Furthermore, the large projection dimension

d_{model}

, combined with hardware constraints—specifically, the use of a CPU instead of a GPU—contributed to the model’s slow average inference time of

0.322

s per instance.

6. Conclusions

Quick identification of the potential cause of an aviation incident is crucial for preventing future tragedies. While flight data recorders are commonly used, delays or damage can obstruct their effectiveness. The Boeing 737 MAX accidents with Lion Air and Ethiopian Airlines highlight the impact of such delays. To improve investigation efficiency, this study trained a transformer-based model for predicting the probable cause of an aviation incident given an analysis narrative of the series of events that can be collected from sources including eyewitnesses, radar systems, Air traffic controllers that were in charge of the flight under investigation, maintenance history/logs, etc. The model was trained on extensive NTSB aviation incident reports and allows short- and long-input narratives.

Despite the approach demonstrating potential in expediting investigations and enhancing aviation safety—evidenced by its strong performance across evaluation metrics such as BERTScore, BLEU, ROUGE, and LSA—the model’s output can be further improved through training on a more comprehensive dataset. As a direction for future research, analysis narratives from additional aviation investigation bureaus, such as the ATSB, could be integrated with the NTSB narratives. Retraining the model on this expanded dataset is expected to enhance its predictive capabilities and overall performance.

While the evaluation primarily focused on quantitative performance metrics, a critical next step is to incorporate expert assessment of the model’s predictions. Their feedback is essential for validating the model’s applicability, ensuring the accuracy and usefulness of the generated probable cause statements, and aligning the model with real-world aviation safety analysis and decision-making processes. Furthermore, the model was designed to support aviation incident investigators during the preliminary analysis phase by generating probable causes from raw incident narratives, thereby enhancing the efficiency of the investigation process by providing quick insights. However, it is not intended for use during the data collection or final report preparation phases. Moreover, the deployment of AI models in safety-critical aviation systems requires adherence to established regulatory frameworks, such as those set by the FAA and the European Union Aviation Safety Agency (EASA). Future research is needed before use in operational contexts to further explore how the model aligns with existing regulatory standards, particularly in terms of transparency, interpretability, and reliability.

Author Contributions

A.N.: conceptualization, methodology, software, data curation, validation, writing—original draft preparation, H.W.: formal analysis, writing—original draft preparation, K.J.: writing, review, editing, and final draft, U.T.: writing, review, and editing, G.W.: visualization, supervision, final draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data analysed were from NTSB, which is publicly available on the NTSB website.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vidović, A.; Franjić, A.; Štimac, I.; Ban, M.O. The importance of flight recorders in the aircraft accident investigation. Transp. Res. Procedia 2022, 64, 183–190. [Google Scholar] [CrossRef]
Wild, G. Airbus A32x Versus Boeing 737 Safety Occurrences. IEEE Aerosp. Electron. Syst. Mag. 2023, 38, 4–12. [Google Scholar] [CrossRef]
Johnson, C. A Handbook of Incident and Accident Reporting; Glasgow University Press: Glasgow, UK, 2003; Volume 115. [Google Scholar]
Dong, T.; Yang, Q.; Ebadi, N.; Luo, X.R.; Rad, P. Identifying incident causal factors to improve aviation transportation safety: Proposing a deep learning approach. J. Adv. Transp. 2021, 2021, 5540046. [Google Scholar] [CrossRef]
Levin, A.; Suhartono, H. Pilot Who Hitched a Ride Saved Lion Air 737 Day Before Deadly Crash. Bloomberg 2019, 19, 2019. [Google Scholar]
Dahal, S. Letting go and saying goodbye: A Nepalese family’s decision, in the Ethiopian Airline crash ET-302. Forensic Sci. Res. 2022, 7, 383–384. [Google Scholar] [CrossRef] [PubMed]
Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Exploring Aviation Incident Narratives Using Topic Modeling and Clustering Techniques. In Proceedings of the 2024 IEEE Region 10 Symposium (TENSYMP), New Delhi, India, 27–29 September 2024; pp. 1–6. [Google Scholar]
Ahmad, F.; de la Chica, S.; Butcher, K.; Sumner, T.; Martin, J.H. Towards automatic conceptual personalization tools. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada, 18–23 June 2007; pp. 452–461. [Google Scholar]
Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [Google Scholar] [CrossRef]
Kuhn, K.D. Using structural topic modeling to identify latent topics and trends in aviation incident reports. Transp. Res. Part C Emerg. Technol. 2018, 87, 105–122. [Google Scholar] [CrossRef]
Li, Z.; Zhang, H.; Wang, S.; Huang, F.; Li, Z.; Zhou, J. Exploit latent Dirichlet allocation for collaborative filtering. Front. Comput. Sci. 2018, 12, 571–581. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Phase of flight classification in aviation safety using lstm, gru, and bilstm: A case study with asn dataset. In Proceedings of the 2023 International Conference on High Performance Big Data and Intelligent Systems (HDIS), Macau, China, 6–8 December 2023; pp. 24–28. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog. 2019, 1, 9. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Aviation Safety Enhancement via NLP & Deep Learning: Classifying Flight Phases in ATSB Safety Reports. In Proceedings of the 2023 Global Conference on Information Technologies and Communications (GCITC), Bangalore, India, 1–3 December 2023; pp. 1–5. [Google Scholar]
Liu, X.; Duh, K.; Gao, J. Stochastic answer networks for natural language inference. arXiv 2018, arXiv:1804.07888. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Comparative Study of Deep Learning Architectures for Textual Damage Level Classification. In Proceedings of the 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 21–22 March 2024; pp. 421–426. [Google Scholar]
Ainslie, J.; Ontanon, S.; Alberti, C.; Cvicek, V.; Fisher, Z.; Pham, P.; Ravula, A.; Sanghai, S.; Wang, Q.; Yang, L. ETC: Encoding long and structured inputs in transformers. arXiv 2020, arXiv:2004.08483. [Google Scholar]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 12–18 July 2020; pp. 5156–5165. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Burnett, R.A.; Si, D. Prediction of injuries and fatalities in aviation accidents through machine learning. In Proceedings of the International Conference on Compute and Data Analysis, Lakeland, FL, USA, 19–23 May 2017; pp. 60–68. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Turhan, U.; Molloy, O.; Wild, G. Sequential Classification of Aviation Safety Occurrences with Natural Language Processing. In Proceedings of the AIAA AVIATION 2023 Forum, San Diego, CA, USA, 12–16 June 2023; p. 4325. [Google Scholar]
Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Applications of natural language processing in aviation safety: A review and qualitative analysis. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 2153. [Google Scholar]
Zhang, X.; Mahadevan, S. Ensemble machine learning models for aviation incident risk prediction. Decis. Support Syst. 2019, 116, 48–63. [Google Scholar]
Zhang, X.; Mahadevan, S. Bayesian network modeling of accident investigation reports for aviation safety assessment. Reliab. Eng. Syst. Saf. 2021, 209, 107371. [Google Scholar]
Valdés, R.M.A.; Comendador, V.F.G.; Sanz, L.P.; Sanz, A.R. Prediction of aircraft safety incidents using Bayesian inference and hierarchical structures. Saf. Sci. 2018, 104, 216–230. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Molloy, O.; Turhan, U.; Wild, G. Natural language processing and deep learning models to classify phase of flight in aviation safety occurrences. In Proceedings of the 2023 IEEE Region 10 Symposium (TENSYMP), Canberra, Australia, 6–8 September 2023; pp. 1–6. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Topic Modeling Analysis of Aviation Accident Reports: A Comparative Study between LDA and NMF Models. In Proceedings of the 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), Bangalore, India, 29–31 December 2023; pp. 1–2. [Google Scholar]
de Vries, V. Classification of aviation safety reports using machine learning. In Proceedings of the 2020 International Conference on Artificial Intelligence and Data Analytics for Air Transportation (AIDA-AT), Singapore, 3–4 February 2020; pp. 1–6. [Google Scholar]
Buselli, I.; Oneto, L.; Dambra, C.; Gallego, C.V.; Martínez, M.G.; Smoker, A.; Martino, P.R. Natural Language Processing and Data-Driven Methods for Aviation Safety and Resilience: From Extant Knowledge to Potential Precursors. Open Research Europe. 2021. Available online: https://www.sesarju.eu/sites/default/files/documents/sid/2021/papers/SIDs_2021_paper_50.pdf (accessed on 24 March 2025).
Tanguy, L.; Tulechki, N.; Urieli, A.; Hermann, E.; Raynal, C. Natural language processing for aviation safety reports: From classification to interactive analysis. Comput. Ind. 2016, 78, 80–95. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Joiner, K.; Turhan, U.; Wild, G. Explainable Supervised Learning Models for Aviation Predictions in Australia. Aerospace 2025, 12, 223. [Google Scholar] [CrossRef]
Perboli, G.; Gajetti, M.; Fedorov, S.; Giudice, S.L. Natural Language Processing for the identification of Human factors in aviation accidents causes: An application to the SHEL methodology. Expert Syst. Appl. 2021, 186, 115694. [Google Scholar]
Shen, N.; Gao, K.; Niu, T.; Li, Q.; Peng, R. Aero-Engine Life Prediction Based on ARIMA and LSTM with Multi-head Attention Mechanism. In Analytics Modeling in Reliability and Machine Learning and Its Applications; Springer: Berlin/Heidelberg, Germany, 2025; pp. 77–90. [Google Scholar]
Li, Z.; Luo, S.; Liu, H.; Tang, C.; Miao, J. TTSNet: Transformer–Temporal Convolutional Network–Self-Attention with Feature Fusion for Prediction of Remaining Useful Life of Aircraft Engines. Sensors 2025, 25, 432. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Tang, Z.; Wei, J. Multi-Layer Perceptron Model Integrating Multi-Head Attention and Gating Mechanism for Global Navigation Satellite System Positioning Error Estimation. Remote Sens. 2025, 17, 301. [Google Scholar] [CrossRef]
Cai, H.; Shao, X.; Zhou, P.; Li, H. Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study. Electronics 2025, 14, 434. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning deep transformer models for machine translation. arXiv 2019, arXiv:1906.01787. [Google Scholar]
Yao, S.; Wan, X. Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4346–4350. [Google Scholar]
Raganato, A.; Tiedemann, J. An analysis of encoder representations in transformer-based machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 287–297. [Google Scholar]
Khandelwal, U.; Clark, K.; Jurafsky, D.; Kaiser, L. Sample efficient text summarization using a single pre-trained transformer. arXiv 2019, arXiv:1905.08836. [Google Scholar]
Liu, Y.; Lapata, M. Text summarization with pretrained encoders. arXiv 2019, arXiv:1908.08345. [Google Scholar]
Sheang, K.C.; Saggion, H. Controllable sentence simplification with a unified text-to-text transfer transformer. In Proceedings of the 14th International Conference on Natural Language Generation (INLG), Aberdeen, UK, 20–24 September 2021; ACL (Association for Computational Linguistics): Aberdeen, UK, 2021. [Google Scholar]
Alissa, S.; Wald, M. Text simplification using transformer and BERT. Comput. Mater. Contin. 2023, 75, 3479–3495. [Google Scholar]
Alikaniotis, D.; Raheja, V. The unreasonable effectiveness of transformer language models in grammatical error correction. arXiv 2019, arXiv:1906.01733. [Google Scholar]
Hossain, N.; Bijoy, M.H.; Islam, S.; Shatabda, S. Panini: A transformer-based grammatical error correction method for Bangla. Neural Comput. Appl. 2024, 36, 3463–3477. [Google Scholar]
Shao, T.; Guo, Y.; Chen, H.; Hao, Z. Transformer-based neural network for answer selection in question answering. IEEE Access 2019, 7, 26146–26156. [Google Scholar]
Cordonnier, J.B.; Loukas, A.; Jaggi, M. Multi-head attention: Collaborate instead of concatenate. arXiv 2020, arXiv:2006.16362. [Google Scholar]
Zhang, X.; Shen, Y.; Huang, Z.; Zhou, J.; Rong, W.; Xiong, Z. Mixture of attention heads: Selecting attention heads per token. arXiv 2022, arXiv:2210.05144. [Google Scholar]
Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Liu, B.; Zheng, Q.; Wei, H.; Zhao, J.; Yu, H.; Zhou, Y.; Chao, F.; Ji, R. Deep hybrid transformer network for robust modulation classification in wireless communications. Knowl.-Based Syst. 2024, 300, 112191. [Google Scholar]
Kedia, A.; Zaidi, M.A.; Khyalia, S.; Jung, J.; Goka, H.; Lee, H. Transformers get stable: An end-to-end signal propagation theory for language models. arXiv 2024, arXiv:2403.09635. [Google Scholar]
Muraina, I. Ideal dataset splitting ratios in machine learning algorithms: General concerns for data scientists and data analysts. In Proceedings of the 7th International Mardin Artuklu Scientific Research Conference, Mardin, Turkey, 10–12 December 2021; pp. 496–504. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Park, C.; Yang, Y.; Lee, C.; Lim, H. Comparison of the evaluation metrics for neural grammatical error correction with overcorrection. IEEE Access 2020, 8, 106264–106272. [Google Scholar] [CrossRef]
Min, J.H.; Jung, S.J.; Jung, S.H.; Yang, S.; Cho, J.S.; Kim, S.H. Grammatical error correction models for Korean language via pre-trained denoising. Quant. Bio-Sci. 2020, 39, 17–24. [Google Scholar]
Yadav, A.K.; Singh, A.; Dhiman, M.; Vineet; Kaundal, R.; Verma, A.; Yadav, D. Extractive text summarization using deep learning approach. Int. J. Inf. Technol. 2022, 14, 2407–2415. [Google Scholar]
Manojkumar, V.; Mathi, S.; Gao, X.Z. An experimental investigation on unsupervised text summarization for customer reviews. Procedia Comput. Sci. 2023, 218, 1692–1701. [Google Scholar]
Van den Bercken, L.; Sips, R.J.; Lofi, C. Evaluating neural text simplification in the medical domain. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 3286–3292. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Kryściński, W.; Paulus, R.; Xiong, C.; Socher, R. Improving abstraction in text summarization. arXiv 2018, arXiv:1808.07913. [Google Scholar]
Jain, M.; Saha, S.; Bhattacharyya, P.; Chinnadurai, G.; Vatsa, M.K. Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction. arXiv 2021, arXiv:2112.03849. [Google Scholar]
Ng, J.P.; Abrecht, V. Better summarization evaluation with word embeddings for ROUGE. arXiv 2015, arXiv:1508.06034. [Google Scholar]
Dorr, B.; Monz, C.; Schwartz, R.; Zajic, D. A methodology for extrinsic evaluation of text summarization: Does ROUGE correlate? In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 1–8. [Google Scholar]
Barbella, M.; Tortora, G. Rouge Metric Evaluation for Text Summarization Techniques. 2022. Available online: https://ssrn.com/abstract=4120317 (accessed on 26 May 2022).
Huang, J.; Jiang, Y. A DAE-based Approach for Improving the Grammaticality of Summaries. In Proceedings of the 2021 International Conference on Computers and Automation (CompAuto), Paris, France, 7–9 September 2021; pp. 50–53. [Google Scholar]
Banerjee, S.; Kumar, N.; Madhavan, C.V. Text Simplification for Enhanced Readability. In Proceedings of the KDIR/KMIS, Vilamoura, Portugal, 19–22 September 2013; pp. 202–207. [Google Scholar]
Zaman, F.; Shardlow, M.; Hassan, S.U.; Aljohani, N.R.; Nawaz, R. HTSS: A novel hybrid text summarisation and simplification architecture. Inf. Process. Manag. 2020, 57, 102351. [Google Scholar]
Phatak, A.; Savage, D.W.; Ohle, R.; Smith, J.; Mago, V. Medical text simplification using reinforcement learning (teslea): Deep learning–based text simplification approach. JMIR Med. Inform. 2022, 10, e38095. [Google Scholar]
Landauer, T.K.; Foltz, P.W.; Laham, D. An introduction to latent semantic analysis. Discourse Process. 1998, 25, 259–284. [Google Scholar]
Landauer, T.K.; Dumais, S.T. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 1997, 104, 211–240. [Google Scholar] [CrossRef]
Steinberger, J.; Jezek, K. Using latent semantic analysis in text summarization and summary evaluation. Proc. ISIM 2004, 4, 8. [Google Scholar]
Ozsoy, M.G.; Alpaslan, F.N.; Cicekli, I. Text summarization using latent semantic analysis. J. Inf. Sci. 2011, 37, 405–417. [Google Scholar]
Gong, Y.; Liu, X. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, USA, 9–13 September 2001; pp. 19–25. [Google Scholar]
Hao, S.; Xu, Y.; Ke, D.; Su, K.; Peng, H. SCESS: A WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis. Nat. Lang. Eng. 2016, 22, 291–319. [Google Scholar] [CrossRef]
Vajjala, S.; Meurers, D. Readability assessment for text simplification: From analysing documents to identifying sentential simplifications. ITL-Int. J. Appl. Linguist. 2014, 165, 194–222. [Google Scholar]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Cer, D.; Yang, Y.; Kong, S.Y.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 4 November 2018; pp. 169–174. [Google Scholar]
Nanyonga, A.; Wild, G. Impact of Dataset Size & Data Source on Aviation Safety Incident Prediction Models with Natural Language Processing. In Proceedings of the 2023 Global Conference on Information Technologies and Communications (GCITC), Bengaluru, India, 14–16 December 2023; pp. 1–7. [Google Scholar]
Darveau, K.; Hannon, D.; Foster, C. A comparison of rule-based and machine learning models for classification of human factors aviation safety event reports. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Online, 5–9 October 2020; SAGE Publications Sage: Los Angeles, CA, USA, 2020; Volume 64, pp. 129–133. [Google Scholar]
Zhao, X.; Yan, H.; Liu, Y. Hierarchical Multilabel Classification for Fine-Level Event Extraction from Aviation Accident Reports. INFORMS J. Data Sci. 2024, 4, 1–99. [Google Scholar] [CrossRef]
Xiong, S.H.; Wei, X.H.; Chen, Z.S.; Zhang, H.; Pedrycz, W.; Skibniewski, M.J. Identifying causes of aviation safety events using wW2V-tCNN with data augmentation. Int. J. Gen. Syst. 2025, 1–30. [Google Scholar] [CrossRef]
Hou, Z.; Wang, H.; Yue, Y.; Xiong, M.; Che, C. A novel method for cause portrait of aviation unsafe events based on hierarchical multi-task convolutional neural network. Expert Syst. Appl. 2025, 270, 126466. [Google Scholar] [CrossRef]
Ni, X.; Wang, H.; Chen, L.; Lin, R. Classification of aviation incident causes using LGBM with improved cross-validation. J. Syst. Eng. Electron. 2024, 35, 396–405. [Google Scholar] [CrossRef]
Xiong, M.; Hou, Z.; Wang, H.; Che, C.; Luo, R. An aviation accidents prediction method based on MTCNN and Bayesian optimization. Knowl. Inf. Syst. 2024, 66, 6079–6100. [Google Scholar] [CrossRef]
Liu, H.; Hu, M.; Yang, L. A new risk level identification model for aviation safety. Eng. Appl. Artif. Intell. 2024, 136, 108901. [Google Scholar] [CrossRef]
Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
Xiong, M.; Wang, H.; Wong, Y.D.; Hou, Z. Enhancing aviation safety and mitigating accidents: A study on aviation safety hazard identification. Adv. Eng. Inform. 2024, 62, 102732. [Google Scholar] [CrossRef]
Katragadda, S.R.; Tanikonda, A.; Pandey, B.K.; Peddinti, S.R. Machine Learning-Enhanced Root Cause Analysis for Rapid Incident Management in High-Complexity Systems. J. Sci. Technol. 2022, 3, 325–345. [Google Scholar]

Figure 1. Text length distribution of the analysisNarrative field entries.

Figure 2. Text length distribution of the probableCause field entries.

Figure 3. Transformer model architecture and and a break down of its architectural components—(a) full transformer block architecture; (b) architectural components of the encoder block; (c) architectural components of the decoder block; (d) components of the transformer output block.

Figure 4. Some examples of analysis narratives with corresponding probable causes as presented in the original report and the model’s predicted probable causes.

Figure 5. BERTscore values: (a)

B E R T_{P}

(b)

B E R T_{R}

and (c)

B E R T_{F 1}

.

Figure 5. BERTscore values: (a)

B E R T_{P}

(b)

B E R T_{R}

and (c)

B E R T_{F 1}

.

Figure 6. BLEU scores for the first 1000 test instances.

Figure 7. LSA-similarity scores for the first 1000 test instances.

Figure 8. Impact of analysis narrative’s length on the model’s BLEU score.

Figure 9. Impact of analysisNarrative length on the model’s LSA similarity score.

Figure 10. Reference model output screenshot for discussing the BLEU, ROUGE, and LSA similarity scores: the model’s output constitutes a slightly different word set from the reference probable cause.

Table 1. ROUGE results: precision, recall, and F-measure from Rouge-1, Rouge-2, and Rouge-L.

Metric	Precision		Recall		F-Measure
	Mean	Stddev	Mean	Stddev	Mean	Stddev
rouge-1	0.666	0.217	0.610	0.211	0.618	0.192
rouge-2	0.488	0.264	0.448	0.257	0.452	0.248
rouge-L	0.602	0.241	0.553	0.235	0.560	0.220

Table 2. BLEU scores for various weight vector values.

Weight Vector	BLEU Score
$[0.1, 0.1, 0.1, 0.1)]$	$8.67 \times 10^{- 32}$
$[0.01, 0.01, 0.01, 0.01]$	$7.83 \times 10^{- 4}$
$[0.25, 0.25, 0, 0]$	0.459
$[0.1, 0.1, 0, 0]$	0.732
$[0.01, 0.01, 0, 0]$	0.969

Table 3. Comparison of NLP models for predicting probable causes in aviation incidents.

Authors	NTSB Dataset	NLP	Transformer-Based Approach	BERT Score	ROUGE Score	BLEU Score	LSA Similarity
Nanyonga et al. [83]	Y	Y	N	N	N	N	N
Darveau et al. [84]	N	Y	N	N	N	N	N
Zhao et al. [85]	Y	Y	N	N	N	N	N
Xiong et al. [86]	N	Y	N	N	N	N	N
Hou et al. [87]	N	N	N	N	N	N	N
Ni et al. [88]	N	N	N	N	N	N	N
Xiong et al. [89]	Y	N	N	N	N	N	N
Liu et al. [90]	Y	N	N	N	N	N	N
Akiko Aizawa [91]	N	N	N	N	N	N	N
Xiong et al. [92]	N	N	N	N	N	N	N
Perboli et al. [35]	N	Y	N	N	N	N	N
Katragadda et al. [93]	N	Y	N	N	N	N	N
Dong et al. [4]	N	Y	N	N	N	N	N
Our study	Y	Y	Y	Y	Y	Y	Y

Table 4. Examples where the model produced insufficient information or incorrect predictions.

Observed “probableCause”	Predicted “probableCause”
The pilot’s inadequate compensation for the crosswind condition and failure to maintain directional control of the aircraft on landing.	The pilot’s failure to maintain directional control during landing.
The incapacitation of the pilot during high-altitude cruise flight for undetermined reasons.	The pilot’s intentional flight into terrain as a result of his impairment due to alcohol consumption.
The pilot’s failure to monitor the balloon’s altitude, which resulted in the balloon impacting an airport rotating beacon tower.	The pilot’s failure to maintain clearance from a powerline during landing.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nanyonga, A.; Wasswa, H.; Joiner, K.; Turhan, U.; Wild, G. A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents. Modelling 2025, 6, 27. https://doi.org/10.3390/modelling6020027

AMA Style

Nanyonga A, Wasswa H, Joiner K, Turhan U, Wild G. A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents. Modelling. 2025; 6(2):27. https://doi.org/10.3390/modelling6020027

Chicago/Turabian Style

Nanyonga, Aziida, Hassan Wasswa, Keith Joiner, Ugur Turhan, and Graham Wild. 2025. "A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents" Modelling 6, no. 2: 27. https://doi.org/10.3390/modelling6020027

APA Style

Nanyonga, A., Wasswa, H., Joiner, K., Turhan, U., & Wild, G. (2025). A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents. Modelling, 6(2), 27. https://doi.org/10.3390/modelling6020027

Article Menu

A Multi-Head Attention-Based Transformer Model for Predicting Causes in Aviation Incidents

Abstract

1. Introduction

2. Related Work

3. Proposed Approach

3.1. Dataset

3.2. Data Pre-Processing

3.3. The Transformer

3.4. Experimental Setup

3.5. Performance Metrics

3.5.1. Bilingual Evaluation Understudy (BLEU)

3.5.2. Recall Oriented Understudy for Gisting Evaluation (ROUGE)

3.5.3. Latent Semantic Analysis (LSA)

3.5.4. BERTScore

4. Results

4.1. Model Performance Based on the BERTScore

4.2. Model Performance Based on the BLEU Score

4.3. Model Performance Based on the LSA Similarity Score

4.4. Model Performance Based on the ROUGE Scores

4.5. analysisNarrative Length vs. BLEU/LSA Scores

5. Discussion

Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI