Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment

Nanyonga, Aziida; Joiner, Keith; Turhan, Ugur; Wild, Graham

doi:10.3390/technologies13050209

Open AccessArticle

Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment

¹

School of Engineering and Technology, University of New South Wales, Canberra, ACT 2600, Australia

²

Capability Systems Centre, University of New South Wales, Canberra, ACT 2610, Australia

³

School of Science, University of New South Wales, Canberra, ACT 2612, Australia

^*

Author to whom correspondence should be addressed.

Technologies 2025, 13(5), 209; https://doi.org/10.3390/technologies13050209

Submission received: 27 March 2025 / Revised: 7 May 2025 / Accepted: 14 May 2025 / Published: 19 May 2025

(This article belongs to the Special Issue Aviation Science and Technology Applications)

Download

Browse Figures

Versions Notes

Abstract

This study presents a comparative analysis of four topic modeling techniques —Latent Dirichlet Allocation (LDA), Bidirectional Encoder Representations from Transformers (BERT), Probabilistic Latent Semantic Analysis (pLSA), and Non-negative Matrix Factorization (NMF)—applied to aviation safety reports from the ATSB dataset spanning 2013–2023. The evaluation focuses on coherence, interpretability, generalization, computational efficiency, and scalability. The results indicate that NMF achieves the highest coherence score (0.7987), demonstrating its effectiveness in extracting well-defined topics from structured narratives. pLSA performs competitively (coherence: 0.7634) but lacks the scalability of NMF. LDA and BERTopic, while effective in generalization (perplexity: −6.471 and −4.638, respectively), struggle with coherence due to their probabilistic nature and reliance on contextual embeddings. A preliminary expert review by two aviation safety specialists found that topics generated by the NMF model were interpretable and aligned well with domain knowledge, reinforcing its potential suitability for such aviation safety analysis. Future research should explore new hybrid modeling approaches and real-time applications to enhance aviation safety analysis further. The study contributes to advancing automated safety monitoring in the aviation industry by refining the most appropriate topic modeling techniques.

Keywords:

aviation safety; topic modeling; BERT; LDA; NMF; pLSA; NLP; ATSB dataset; risk assessment

1. Introduction

Aviation safety is a critical concern for the global air transport industry, with incident and accident reports playing a crucial role in understanding and mitigating risks [1]. The Australian Transport Safety Bureau (ATSB) is an Australian organization dedicated to aviation safety investigations, generating a vast repository of textual incident reports that contain valuable insights into operational hazards, human factors, and system failures [2,3]. Analyzing these reports manually can be time-consuming, error-prone, and subject to interpretational biases [4]. As textual datasets continue to expand, natural language processing (NLP) techniques, particularly topic modeling, have emerged as powerful tools for extracting latent patterns from unstructured text to facilitate data-driven decision-making [5].

Topic modeling techniques uncover hidden structures in large corpora by grouping semantically-related words into coherent topics. Various algorithms have been developed to achieve this, with the most widely used approaches being Latent Dirichlet Allocation (LDA) [5], Non-Negative Matrix Factorization (NMF) [6], and Probabilistic Latent Semantic Analysis (pLSA) [7]. More recently, Bidirectional Encoder Representations from Transformers (BERT) has introduced transformer-based topic modeling, leveraging deep learning to enhance topic coherence and contextual understanding [8]. Despite the increasing adoption of these models, there remains a lack of comparative studies focused on aviation safety narratives, especially on understanding how the model choice affects the reliability and interpretability of extracted safety themes. Furthermore, limited research exists on evaluating the trade-offs between traditional probabilistic methods and modern transformer-based approaches within domain-specific corpora. This research gap underscores the academic need for a systematic domain-focused investigation into topic modelling techniques for aviation safety. The motivation for this study stems from the growing volume of unstructured textual reports in aviation safety databases like ATSB, where the narratives contain rich underutilized insights that, if properly modelled, could significantly enhance situational awareness, safety interventions, and risk assessment strategies.

The primary research question addressed in this work is

Does the choice of topic modeling technique influence the interpretability and reliability of extracted aviation safety themes?

To answer this research question, we conduct a comparative analysis of four widely used topic modeling techniques: (1) LDA—a probabilistic generative model that assumes documents are mixtures of topics, with each topic represented by a distribution over words, (2) NMF, a matrix factorization technique that decomposes term-document matrices [9] into lower-rank representations, producing more interpretable topics [10], that has been employed in various studies to extract topics from text data as an alternative approach to LDA (3). pLSA, a statistical model like LDA but without a Dirichlet prior, making it more susceptible to overfitting, and (4) BERT—a transformer-based model that uses contextual word embeddings and clustering to generate high-quality topics dynamically. Each of these techniques is described in greater depth later in Section 3.2.

By applying these models to ATSB incident reports, we aim to evaluate: (a) Topic coherence and relevance—the quality and interpretability of topics generated by each model, (b) Performance metrics—assessing topic coherence scores, i.e., Coherence Score (C_v) to quantify model effectiveness, and (c) Practical implications for aviation safety analysis—identifying which models provide the most actionable insights for safety investigators, policymakers, and regulatory bodies.

The findings of this research are expected to contribute to both NLP and aviation safety analytics in several ways. First, by systematically evaluating topic modeling techniques on a domain-specific dataset, this study bridges the gap between theoretical advancements in NLP and practical applications in aviation risk assessment [11,12,13]. Second, the results will guide aviation safety analysts in selecting the most appropriate topic modeling approach for text-driven risk assessment. Third, our study highlights the advantages and limitations of traditional probabilistic models versus modern transformer-based models, offering insights into how deep learning enhances the interpretability of aviation narratives.

The paper is structured as follows: Section 2 presents a comprehensive literature review on topic modeling applications in aviation safety and accident investigations. Section 3 outlines the methodology, including dataset preprocessing, model selection, and evaluation metrics. Section 4 presents the results, comparing the topic distributions, coherence scores, and interpretability of each model. Section 5 provides a detailed discussion of the findings, including the strengths and limitations of each method. Finally, Section 6 concludes with key insights and suggestions for future research directions.

2. Related Work

The application of topic modeling techniques in aviation safety research has gained significant attention, given the availability of large-scale incident reports coupled with the growing need for efficient data-driven risk assessment. This section reviews the existing literature on the role of NLP and topic modeling in aviation safety analysis, comparative studies on topic modeling techniques, and recent advancements in deep learning-based approaches to topic modeling.

The use of NLP in aviation safety has expanded considerably in recent years, particularly in automating the analysis of safety reports to extract actionable insights [14]. As previously noted, traditional manual approaches to analyzing aviation incident narratives can be time-consuming, inconsistent, and susceptible to human bias [4,15,16]. To address these limitations, topic modeling techniques have been employed to uncover latent patterns in unstructured textual data, providing a more systematic, scalable, and objective means of identifying key risk factors in aviation operations [17]. A summary of selected studies that have applied topic modeling techniques in aviation safety research is presented in Table 1.

A substantial body of research has investigated the effectiveness of topic modeling techniques in aviation safety. For example, Luo and Shi, in 2019, identify topics related to accident causation and contributing factors within aviation accident reports, effectively showcasing the extraction of meaningful topics [13]. Early studies from 2016 to 2021 focused on LDA as a method for structuring aviation safety narratives, enabling researchers to categorize incident reports into meaningful topics [18,19]. For example, Robinson, in 2019, harnessed LDA to extract topics from aviation safety reports and identify emerging safety concerns [20]. Similarly, Ahadh et al. delved into topic modelling to categorize narratives in aviation accident reports, furnishing a structured representation of accident data [19]. Kuhn [21] made use of text-mining techniques to analyze aviation accident reports, with a focus on identifying significant terms and phrases, thereby establishing the groundwork for the application of computational methods in accident report analysis. Zhong et al. introduced a framework in 2020 that seamlessly melded text mining and machine learning, enabling the automated classification of accident reports into categories based on contributing factors. Their approach served as a testament to the potential for automating key aspects of accident analysis, consequently enhancing efficiency and consistency [22].

Several more recent comparative studies have evaluated different topic modeling techniques across various domains, including social media, tweets, healthcare, and finance. For example, a study by Krishnan [23] conducted an extensive analysis of multiple topic modeling methods, including LSA, LDA, NMF, and others, applied to customer reviews. This study also evaluated BERTopic, which is a topic modeling technique developed around 2022 [8] that leverages BERT embeddings and class-based Term Frequency-Inverse Document Frequency (TF-IDF) to create dense clusters, allowing for easily interpretable topics whilst keeping important words in the topic descriptions. Their findings highlighted the relative strengths of each approach in identifying key themes. Further recent research integrated BERT-based embeddings with traditional topic modeling methods, such as LSA, LDA, and the Hierarchical Dirichlet Process (HDP), proposing a hybrid model called HDP-BERT, which demonstrated improved coherence scores [24]. Similarly, another recent study examined BERTopic, NMF, and LDA in social media content analysis, utilizing coherence measures such as C_V and U_MASS [25] to assess model effectiveness, which underscored BERTopic’s superior performance in extracting meaningful topics from unstructured text [26].

Last year, Kaur and Wallace conducted a qualitative evaluation of topic modeling techniques by engaging researchers in the social sciences, revealing that BERTopic was preferred for its ability to generate coherent and interpretable topics, particularly in online community discussions [27]. Similarly, a study two years ago applied LDA, NMF, and BERTopic to academic text analysis, categorizing scholarly articles into key themes such as sustainability, healthcare, and engineering education, thereby reinforcing the robustness of topic modeling for research applications [28].

Abuzayed and Khalifa et al. [29] assessed the performance of BERTopic using Arabic pre-trained language models, comparing it against LDA and NMF. Their findings demonstrated that transformer-based embeddings could enhance topic coherence across languages, a result that aligns with another study, which evaluated LDA, NMF, and BERTopic on Serbian literary texts [30]. In these studies, NMF yielded the best coherence results; however, BERTopic excelled in topic diversity, emphasizing the importance of dataset characteristics in model selection.

In 2022, Rose et al. [31] presented a case on the application of structural topic modeling, specifically LDA, to aviation accident reports. Their research demonstrated the feasibility of utilizing topic modeling to uncover latent themes within accident narratives. Moreover, it highlighted the potential for automating certain aspects of the analysis process, thereby increasing efficiency. Their research emphasized the critical importance of selecting an appropriate topic modelling technique tailored to specific domains and datasets, accentuating the need for a nuanced approach to topic modelling in aviation safety analysis [31]. Our previous work has extensively explored topic modeling in aviation safety using different datasets and methodologies [11,32,33]. For instance, one study analyzed aviation safety narratives from the Socrata dataset using LDA, NMF, and pLSA, identifying key themes such as pilot error and mechanical failures [34]. Another study applied multiple topic modeling and clustering techniques to NTSB aviation incident narratives, comparing LDA, NMF, pLSA, LSA, and K-means clustering, demonstrating that LDA had the highest coherence while clustering techniques provided additional insights into incident characteristics [32]. More recently, a comparative analysis of traditional topic modeling techniques on the ATSB dataset was conducted, providing an initial exploration of topic modeling techniques in the Australian aviation context [34]. However, no prior research has applied BERTopic to this dataset. This study builds upon our previous work by incorporating BERTopic and comparing its performance against traditional topic modeling approaches. BERTopic [8] has gained attention for its ability to capture contextual relationships using transformer-based embeddings (e.g., BERT, RoBERTa) [35,36,37]. Unlike traditional approaches, BERTopic clusters semantically similar embeddings before applying topic reduction techniques, resulting in more meaningful and interpretable topics [8]. Early research on BERTopic has demonstrated its superiority in handling short-text datasets [29]. This gap in the literature motivates our study.

Table 1. Summary of a subset of the reviewed literature on topic modelling in aviation accident research.

Authors (Year)	Topic Modeling Method(s)	Data Source(s)	Application Domain	Key Findings/Contributions
Xing et al. [38]	Structural Topic Model (STM), TF-IDF, Word Co-occurrence Network (WCN)	National Transportation Safety Board (NTSB) accident reports	Aviation safety analysis	STM provided granular topic partitioning; WCN identified key risk factors like “inspection of equipment” and “take off”
Nanyonga et al. [32]	LDA, NMF, LSA, PLSA, K-means clustering	NTSB aviation incident narratives	Comparative NLP analysis in aviation safety	LDA achieved highest coherence; clustering revealed thematic commonalities in incident narratives
Liu et al. [39]	LDA, BERT-based Semantic Network (BSN)	Air Traffic Control (ATC) incident reports	Air traffic control risk analysis	Identified 17 risk topics; human factors and operational procedures were prominent; BSN highlighted inter-topic correlations
Kuhn. D, [21]	STM	ASRS reports	Aviation incident trend analysis	STM uncovered issues like fuel pump and landing gear problems; highlighted specific approach path concerns at SFO
Rose et al. [31]	STM	ASRS and NTSB reports	Aviation safety data analysis	STM effectively identified themes within technical datasets; performance improved with specific corpora
Xu et al. [40]	Text Classification, BERTopics	Chinese civil aviation safety oversight reports	Safety oversight automation	Proposed method improved classification accuracy; reduced manual workload in analyzing oversight reports
Luo and Shi, [13]	lda2vec	ASRS reports	Aviation safety report analysis	Unsupervised approach identified latent topics with higher interpretability; reduced reliance on manual labeling
Robinson. S, [20]	LDA	ASRS reports	Temporal trend analysis in aviation safety	Identified temporal trends in safety concerns; demonstrated effectiveness of NLP in sensemaking
Ahadh et al. [19]	Semi-supervised keyword extraction, Topic Modeling	ASRS and Pipeline and Hazardous Materials Safety Administration (PHMSA)	Cross-domain accident analysis	Achieved 80% classification accuracy; method effective with limited manual intervention

While previous studies have applied topic modeling to aviation safety reports, there is no comprehensive comparative analysis of how different models impact the interpretability and reliability of extracted aviation safety themes. By integrating deep learning-based embeddings, BERTopic offers unique advantages in aviation safety research: (a) capturing complex semantic structures in aviation narratives (e.g., distinguishing between pilot errors and ATC miscommunications), (b) handling domain-specific terminology more effectively than conventional bag-of-words-based models, and (c) dynamically adjusting topic granularity, allowing analysts to zoom in on specific safety concerns. Despite these advantages, BERTopic also presents challenges, including higher computational costs, the need for pre-trained language models, and difficulty in fine-tuning for domain-specific datasets. These trade-offs need to be evaluated carefully when applying BERTopic to aviation safety narratives.

In summary, by addressing these gaps, this research aims to enhance the methodological understanding of topic modeling applications in aviation safety, ultimately contributing to improved data-driven risk assessment and accident prevention strategies in the aviation industry. This study selects four topic modeling techniques based on their foundational differences and representation of both classical and modern paradigms in topic modeling. LDA, NMF, and pLSA have been widely adopted as classical approaches due to their probabilistic and matrix factorization-based frameworks. In contrast, BERTopic represents a contemporary deep learning-based method that leverages transformer-based contextual embeddings to generate coherent topics. Including these models gives a comprehensive evaluation spanning traditional statistical techniques and cutting-edge neural approaches.

3. Materials and Methods

For this study, the dataset was obtained directly from ATSB investigation authorities, covering 10 years (2013–2023) and 53,275 records. The dataset contained structured and unstructured textual information, including summaries of aviation safety occurrences and classifications of injury levels. The focus was on text narratives describing the incidents and injury severity levels. Previous studies have relied on publicly available ATSB summaries, whereas this research utilized officially sourced records, ensuring a more comprehensive and nuanced dataset. To facilitate topic extraction, the methodological architecture adopted in this study is illustrated in Figure 1, summarizing the workflow from data acquisition to model evaluation.

3.1. Data Collection and Preprocessing

Given the unstructured nature of aviation safety reports, data cleaning and preprocessing were necessary to ensure accurate and meaningful topic modeling results. The raw text data often contained redundancies, inconsistencies, and non-informative elements, such as special characters, numerical values, and excessive stopwords, which could distort topic extraction. The preprocessing workflow included text normalization techniques, such as lowercasing, tokenization, stopword removal, and lemmatization, to standardize and refine the textual content [41]. Additionally, duplicate records and incomplete entries were identified and removed to maintain dataset integrity. To ensure consistency across topic modeling techniques, the same preprocessed dataset was used for all models. This systematic approach to data preparation allowed for a fair and robust comparative analysis. To enhance understanding, Table 2 presents a selection of anonymized sample records extracted from the ATSB database. These records include key attributes such as occurrence reference, date, location, phase of flight, injury level, damage level, and summary descriptions, providing insight into the structure and richness of the data.

3.2. Topic Modeling Techniques

Following data preprocessing, the cleaned textual data were transformed into numerical representations suitable for topic modeling. This transformation was performed using Term Frequency-Inverse Document Frequency (TF-IDF) and embeddings from pre-trained transformer models as described by Devlin [35]. The study employed four topic modeling techniques:

LDA

LDA [5] is a generative probabilistic model used for topic modelling in large text corpora. It operates as a three-level hierarchical Bayesian model, where each document is represented as a mixture of topics, and each topic is a distribution over words. Unlike traditional clustering methods, LDA assumes that documents share multiple topics to varying extents, making it a suitable approach for uncovering hidden thematic structures in aviation safety narratives. Since the number of topics is not inherently predefined, hyperparameter tuning was performed to determine the optimal number of topics (K) as well as the α (alpha) and β (beta) parameters, which control topic sparsity and word distribution, respectively.

A grid search approach was used to find the most coherent topic distribution, with K ranging from 2 to 15 in increments of 1, per the approach described by [42]. For each iteration, coherence scores, measuring the semantic interpretability of extracted topics, were computed. The highest coherence score of 0.43 was achieved at K = 14, with α = 0.91 and β = 0.91 for a symmetric topic distribution. After fitting the LDA model, topic visualization was conducted using the Python (v3.8.10) Library function pyLDAvis (v3.3.1), which generates an inter-topic distance map to explore the relationships between extracted topics, as shown in Figure 2. These maps allowed for better interpretive analysis of aviation safety themes, highlighting prevalent risks and operational issues.

2.: NMF

NMF [43] is a linear-algebraic decompositional technique that factorizes a term-document matrix (A) into two non-negative matrices: a terms-topics matrix (W) and a topics-documents matrix (H), as shown in Figure 3. This factorization represents A as the product of W and H, where W captures the relationships between terms and topics and H captures the relationships between topics and documents. Unlike LDA, which is probabilistic, NMF relies on matrix decomposition and is particularly effective for sparse and high-dimensional text data. The key advantage of NMF is its deterministic nature, which ensures reproducibility and interpretability when analyzing aviation safety narratives. For this study, TF-IDF transformation was applied to the dataset before implementing NMF. The optimal number of topics was determined by empirically evaluating coherence scores across a range of K-values during hyperparameter tuning per the research [44,45]. After assessing the trade-off between coherence, interpretability, and model stability, K = 10 was selected as the most suitable number of topics. This configuration allowed the aviation safety dataset to be meaningfully categorized into ten dominant themes that were subsequently analyzed to uncover critical safety concerns and recurring operational risks in aviation incidents.

3.: pLSA

pLSA [7], is an alternative probabilistic model for topic extraction that was implemented using the Gensim library, a widely used Python package for topic modeling [46]. The process began with constructing a document-term matrix, where each row represented an aviation safety report, and each column corresponded to a unique term. This matrix was stored in a Gensim corpus object, with the vocabulary defined in a dictionary. The model was then trained using the Expectation–Maximization (EM) algorithm. Key parameters included the number of topics (num_topics), the number of EM iterations (iterations), and the model’s convergence threshold (minimum_probability). The output consisted of topic-word distributions and document-topic distributions, which provided insight into dominant themes across the safety reports. A key distinction between pLSA and LDA is that pLSA does not assume a Dirichlet before topic distributions, potentially leading to a more flexible topic allocation. However, unlike LDA, pLSA does not generalize well to unseen documents, making it more suitable for retrospective analysis rather than predictive modeling. The extracted topics provided an in-depth understanding of key risk factors in aviation safety incidents, reinforcing insights derived from other topic modeling techniques.

4.: BERTopic

BERTopic is a transformer-based approach that leverages pre-trained language models (such as BERT or RoBERTa) to generate high-dimensional semantic embeddings of textual data [8]. These embeddings capture contextual meaning, improving the accuracy of topic extraction compared to conventional bag-of-words models. To cluster similar narratives, the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm was applied to the embeddings per research from last year [47]. This clustering allowed for the automatic determination of the number of topics, reducing the reliance on manual parameter tuning. Unlike LDA and NMF, which require a predefined K, BERTopic dynamically identifies topic clusters based on the semantic structure of the dataset. Dynamic topic representation was applied to enhance interpretability, where the most relevant words within each cluster were identified to create coherent topic labels. Figure 4 shows that once the initial set of topics is identified, an automated topic reduction process can be applied again. In the inter-topic distance map, each circle represents a topic, where the size reflects the frequency of topic occurrence, and the color helps differentiate between clusters. The D1 and D2 axes represent the two dimensions obtained from UMAP, used for visualizing topic similarity based on their embeddings. This approach enabled the detection of nuanced patterns in the aviation incident narratives.

3.3. Model Evaluation

The effectiveness of the topic modeling approaches was assessed through both quantitative metrics and qualitative validation, ensuring the extracted topics were mathematically sound and contextually relevant. Quantitative evaluation involved four key metrics taken as shown in Figure 1. The coherence score measures the interpretability of topics by analyzing word co-occurrence, where a higher score indicates more semantically meaningful topics. Perplexity, applied to probabilistic models like LDA and pLSA, assesses how well the model predicted unseen data, with lower values indicating better generalization. Topic diversity evaluates the uniqueness of discovered topics by quantifying the overlap of top words across topics, ensuring the extracted topics are distinct and informative. Additionally, reconstruction error, mainly used in NMF, measures how well the model preserves the original text corpus, where a lower error implies a better topic representation. In addition to these numerical assessments, a comprehensive qualitative evaluation was conducted to assess the practical relevance and thematic accuracy of the identified topics. Two domain experts in aviation safety reviewed 10% of the extracted topics to determine their relevance to actual aviation incidents, ensuring alignment with known patterns in safety reporting. To further improve interpretability, the consistency of topic labeling was assessed, verifying whether the manually assigned topic names accurately reflected their thematic content. This iterative review process helped filter out ambiguous topics, refine the topic sets, and enhance the reliability of insights derived from the models.

3.4. Implementation Framework

The experiments were conducted using Python 3.8.10 in a Jupyter Notebook environment (v6.5.1) on a Linux server with 256 CPU cores and 256 GB RAM, running Ubuntu 5.4.0-169-generic. Key libraries included NLTK (v3.7) and SpaCy (v3.4.1) for text preprocessing tasks such as tokenization, stopword removal, lemmatization, and punctuation filtering, while Scikit-learn (v1.6.1) handled TF-IDF vectorization and topic modeling techniques like NMF and pLSA. Gensim (v4.3.0) was used for LDA topic modeling, and BERTopic (v0.16.4) enabled transformer-based topic extraction, leveraging UMAP (v0.5.7) for dimensionality reduction and HDBSCAN (v0.8.40) for clustering. TQDM (v4.64.1) was employed to visualize the progress of long-running tasks. Each model was trained using optimized hyperparameters identified through grid search or built-in tuning mechanisms. LDA and pLSA employed Bayesian inference and Expectation–Maximization (EM) algorithms, while NMF relied on non-negative matrix factorization for topic decomposition. The BERTopic used built-in optimization strategies such as automatic topic merging, reducing the need for manual tuning. Probabilistic models (LDA and pLSA) benefited from inference-based approaches [48], whereas matrix-based methods like NMF adopted deterministic factorization techniques [49].

4. Results

This section presents the results of the topic modeling experiments conducted on the ATSB dataset. The findings are categorized into (i) topic coherence and perplexity analysis, (ii) interpretability assessment, (iii) model comparison, and (iv) case studies of extracted topics.

4.1. Coherence Score and Perplexity

The coherence scores and perplexity values for each model are summarized in Figure 5 and Table 3. The coherence score was higher for pLSA (0.763) and NMF (0.798), indicating that these models were better at capturing meaningful and coherent topics. On the other hand, pLSA, LDA, and BERTopic achieved better perplexity scores (−4.62, −6.4, and −4.63, respectively), signifying that these models were more effective at predicting the distribution of words in the dataset, and thus provided a better statistical fit. Overall, pLSA and NMF excelled in coherence, while pLSA, LDA, and BERTopic performed better in perplexity, highlighting a trade-off between interpretability and statistical fit.

4.2. Interpretability Assessment

The following figures present the topic distribution and dominant themes for each model:

4.2.1. Topic Distribution

Figure 6 illustrates the topic distribution for the pLSA model, where Topic 4 was most closely associated with the document distribution, indicating it was the dominant theme within the dataset. This skewed distribution may reflect the prevalence of a genuine dominant theme in aviation safety narratives, such as frequently occurring events like bird strikes or landing gear failures, an outcome not uncommon in domain-specific corpora where certain types of incidents are disproportionately reported. Figure 7 displays the topic distribution for the NMF model, where Topic 4 again appeared prominently, but Topics 1, 7, and 8 also showed strong associations with the document distribution, suggesting that NMF captured additional latent themes not as strongly represented in pLSA. Figure 8 presents the topic distribution for the LDA model, which exhibited a broader spread across topics, with Topic 7 being the most prominent. This more even distribution suggests that LDA was able to capture a wider range of themes. The y-axis in Figure 5 and Figure 6 represents the proportion of documents assigned to each topic, based on normalized topic-document distributions where values for each document sum to one.

4.2.2. Topic Wordcloud

Figure 9, Figure 10, Figure 11 and Figure 12 present the word clouds for pLSA, NMF, LDA, and BERTopic, respectively. In these word clouds, larger words indicate the most frequently chosen terms, while smaller words represent less frequent ones. These visualizations provide a clear overview of the most dominant and least dominant words within each model’s topics, further assisting in the interpretability of the extracted themes.

4.2.3. Topic Word Scores

Figure 13, Figure 14 and Figure 15 present the topic word scores for NMF, BERTopic, and pLSA, respectively, offering a deeper understanding of how each model assigns relevance to specific words within each topic. These word scores indicate the strength with which certain words contribute to the definition of each topic. By examining these scores, we can gain insight into the discriminative power of each model in identifying and distinguishing topics based on word importance. This analysis highlights the varying degrees of topic specificity and the role that individual words play in defining the overall themes of the dataset.

4.3. Model Comparison

4.3.1. Top 10 Words per Model

Table 4 provides a comparative analysis of the key terms extracted by LDA, BERTopic, pLSA, and NMF, demonstrating how each method identifies and groups words into topics. Across all models, certain themes emerge, such as Bird Strikes and Landing Issues, Approach and Airspace Separation, and Engine and Mechanical Failures, underscoring the prevalence and relevance of these topics across aviation incident reports.

LDA and pLSA, both probabilistic, are adept at revealing broad thematic groupings grounded in word co-occurrence patterns. Their outputs tend to include general aviation terminology, offering a comprehensive overview of safety concerns. BERTopic, leveraging contextualized embeddings from transformer models in conjunction with class-based TF-IDF, demonstrates an enhanced capacity for capturing nuanced and domain-specific information. This is exemplified by the appearance of rare or specific terms such as butcherbird, parrot, and Cocos, reflecting incidents that are geographically or contextually unique. In contrast, NMF excels in identifying technical and procedural vocabulary terms, such as GPWS (Ground Proximity Warning System), inspection, and return, which are indicative of its strength in highlighting structured operational patterns.

Figure 16, Figure 17, Figure 18 and Figure 19 visually complement Table 4 by displaying the highest-weighted word from each of the 10 topics generated by LDA, NMF, pLSA, and BERTopic, respectively. These bar charts highlight the dominant term in each topic, providing a concise overview of the thematic focus across models.

4.3.2. Model Strengths and Limitations

Table 5 highlights the trade-offs, as reinforced in this research, between interpretability, computational efficiency, and scalability for each topic modeling technique. LDA produced interpretable topics with distinct word clusters, but it required manual tuning and struggled with overlapping topics. BERTopic excelled in capturing contextual meaning and fine-grained topics, yet its high computational cost and reliance on transformers made it less efficient for resource-constrained applications. PLSA, known for its ability to uncover latent structures, suffered from overfitting and lacked a probabilistic foundation, making it less stable. NMF produced coherent topics with less randomness, yet it required data normalization and was sensitive to noise, limiting its flexibility. These technique strengths and weaknesses are consistent with those identified in other topic modeling studies [50]. The table illustrates that each model has unique strengths, making the selection process highly dependent on the specific dataset and analytical goals.

4.3.3. Model Evaluation

Table 6 evaluates each model on six key aspects: interpretability, granularity, scalability, topic coherence, computational cost, and flexibility. LDA offered high interpretability but requires manual tuning for optimal performance, whereas BERTopic provided the best granularity and topic separation due to its dynamic reduction capabilities. pLSA, while moderately interpretable, was less scalable and computationally expensive, making it less practical for our large dataset. NMF, on the other hand, achieved a balance between interpretability and coherence, though it required preprocessed and normalized input data to perform optimally. BERTopic emerged as the most flexible and scalable model but at the cost of higher computational demand, whereas LDA and NMF offered a practical middle ground for general topic modeling needs. The choice of model by each aviation safety authority will ultimately depend on whether the focus is on interpretability, computational efficiency, or handling complex large-scale datasets.

5. Discussion

The results presented in Figure 4 and Table 3 provide a detailed comparison of four prominent topic modeling techniques, LDA, BERTopic, pLSA, and NMF, applied to the ATSB dataset. This dataset, consisting of short structured texts derived from aviation safety reports, presents a unique challenge in topic modeling, as it demands methods that can effectively capture meaningful patterns from succinct and well-structured narratives. As such, it is necessary to evaluate each model’s performance in terms of coherence, interpretability, computational efficiency, and scalability to determine the most suitable approach.

In terms of coherence, NMF appeared as the most effective model, achieving the highest coherence score of 0.7987. NMF is known to produce well-defined and distinct topics, which align well with the structured nature of the ATSB reports. In contrast, LDA yielded a significantly lower coherence score of 0.4394. This poorer coherence may be attributed to LDA’s reliance on probabilistic distributions to model topics, which, although effective in general topic modeling tasks, struggles with the more deterministic word associations typical of aviation safety texts. Notably, BERTopic achieved an even lower coherence score (0.264), indicating difficulties in generating coherent topics for this specific dataset. While transformer-based models excel at capturing contextual relationships in text, they appear less effective when applied to the structured but less contextually fluid nature of aviation safety narratives. pLSA achieved a coherence score of 0.7634, reflecting its moderate ability to uncover latent semantic structures, though it still fell short of NMF’s performance. This suggests that while pLSA is useful for smaller datasets, its performance may not match the coherence levels achieved by NMF, particularly in more structured domains like aviation safety reports. These performance differences align with findings in the literature that highlight the inherent strengths and limitations of these models in different contexts. For example, a study noted that LDA tends to struggle in domains requiring more fine-grained topic differentiation [5]. Similarly, another study emphasized that while transformer-based models such as BERTopic excel at capturing complex contextual relationships, they can struggle with generating coherent topics when the dataset is highly structured [51].

In terms of generalization, LDA and BERTopic outperformed both pLSA and NMF. The perplexity scores show that LDA and BERTopic achieved perplexity scores of −6.471 and −4.638, respectively. These scores suggest that both models are better equipped to generalize compared to pLSA (−4.6237), demonstrating a slightly lower ability to predict new data. NMF, however, exhibited a positive perplexity score of 2.0739, indicating a propensity for overfitting. This overfitting is likely due to NMF’s deterministic nature and its sensitivity to noise in the data, as it is heavily dependent on the initial data representation [10,43].

These observations support the broader consensus in the literature that probabilistic models, such as LDA, are more robust in terms of generalization across diverse datasets, while deterministic models like NMF may trade generalizability for higher coherence within specific datasets [52,53]. Thus, while NMF may excel in coherence, its performance in generalization to new unseen data appears less reliable [43].

A closer examination of the topic structure (Figure 6) reveals a highly skewed topic distribution, with a single dominant topic accounting for the majority of document assignments. This pattern, observed in the pLSA model, may reflect the prevalence of an authentic dominant theme within aviation safety narratives, such as frequently reported occurrences like bird strikes or landing gear failures. Such thematic concentration is common in domain-specific corpora where certain incident types are disproportionately represented. However, this distribution may also point to potential shortcomings in topic separation or a tendency toward overfitting, particularly in models like pLSA that lack regularization constraints. This dual interpretation highlights the importance of domain-informed validation to discern whether the observed dominance is a genuine data characteristic or an artifact of the modelling approach. Although skewed distributions may capture real-world frequencies, they can also obscure the underlying thematic diversity of the dataset. Accordingly, careful interpretation supported by additional evaluation metrics and expert input is essential to ensure the reliability and contextual relevance of the extracted topics.

When considering the strengths and limitations of each model, LDA offers the advantage of producing interpretable topics with distinct word clusters, making it a popular choice for general topic modeling [54]. However, its reliance on manual tuning for the number of topics and its struggles with overlapping topics make it less effective in contexts requiring fine-grained differentiation, as demonstrated by its performance on the ATSB dataset. Additionally, the probabilistic nature of LDA can hinder its ability to generate coherent topics from structured datasets like those found in aviation safety reports.

BERTopic presents a more dynamic approach to topic modeling. It captures contextual nuances effectively, making it highly suitable for analyzing evolving topics or extracting fine-grained insights from large datasets [8]. However, its high computational cost, sensitivity to hyperparameter tuning, and lower coherence score on structured datasets like the ATSB corpus limit its practical utility for large-scale real-time applications.

pLSA was able to uncover latent structures within the ATSB data. However, its lack of a probabilistic priority and its susceptibility to overfitting, particularly in this complex dataset, restrict its generalization capabilities [55]. In contrast, NMF produced coherent topics with distinct word associations, making it effective for the structured ATSB reports. However, its positive perplexity score suggests limitations in its ability to generalize to unseen data, pointing to a potential issue with overfitting [56].

In evaluating the suitability of these models for the ATSB dataset, NMF emerges as the most effective technique for generating coherent and interpretable topics, making it well-suited for structured aviation safety narratives. The clear interpretable topics produced by NMF are highly relevant for document clustering tasks, where clarity and coherence are paramount. BERTopic, while offering dynamic capabilities for analyzing evolving topics, did not yield adequate coherence scores in this context. LDA, despite its strengths in general topic modeling, struggled with coherence, requiring extensive parameter tuning for optimal performance on the ATSB dataset. pLSA, although valuable for uncovering latent structures, was less effective in terms of scalability and generalization.

The feedback from the expert review indicated that topics generated by the NMF model, in particular, reflected recognizable themes within aviation safety, suggesting its practical applicability. While limited in scope, this expert input provided valuable insight into the interpretive value of the different models and highlighted the importance of domain-informed evaluation in topic modeling applications. Future work could explore hybrid approaches that combine the strengths of deep learning-based embeddings and traditional matrix factorization methods to enhance both topic coherence and interpretability, further advancing the utility of topic modeling in aviation safety and beyond.

6. Conclusions

This study provided a comparative analysis of four topic modeling techniques applied to the ATSB dataset, demonstrating that each model has distinct strengths and limitations in the context of aviation safety report analysis. The results indicate that NMF outperforms the other models in terms of coherence and interpretability, making it the most suitable for extracting meaningful topics from the structured narratives of the ATSB dataset. The clarity and distinctiveness of the topics generated by NMF are highly valuable for tasks such as document clustering, where coherence is crucial. In contrast, while LDA and BERTopic excel in generalization and perplexity, their lower coherence scores indicate challenges in producing interpretable topics from short-text datasets like the ATSB reports.

pLSA, while useful in uncovering latent structures, was less effective in terms of scalability and generalization, making it less reliable for such large datasets. BERTopic’s transformer-based approach, though dynamic and context-sensitive, faced challenges in capturing the structured nature of the dataset and incurred a high computational cost. Despite its flexibility, the performance of BERTopic in structured domains like aviation safety was limited in comparison to NMF.

Ultimately, NMF’s superior coherence and interpretability make it the most suitable choice for the ATSB dataset, particularly for generating distinct and relevant topics for aviation safety analysis. However, future work should explore hybrid approaches that combine the advantages of different models to enhance topic coherence and interpretability. Additionally, exploring real-time applications of topic modeling in aviation safety monitoring could support proactive risk management, the early detection of safety threats, and targeted interventions

The insights derived from topic modeling can aid aviation safety analysts in understanding recurring themes in incident reports, inform the development of focused safety recommendations, and guide the efficient allocation of resources toward mitigating high-risk areas. By leveraging advanced NLP techniques, further research can improve the automation of safety analysis, ultimately contributing to enhanced aviation safety outcomes.

This study contributes to the growing body of the literature on topic modeling, reinforcing the importance of selecting the most appropriate technique based on the dataset’s characteristics and the research goals. The practical implications of this work extend to safety policy development, operational decision-making, and the strategic prioritization of safety interventions in the aviation sector. By demonstrating the potential of topic modeling to reveal meaningful insights into aviation safety, this work paves the way for further research and practical applications in aviation safety analysis, with implications for both policy development and operational improvements.

Author Contributions

A.N.: conceptualization, methodology, software, data curation, validation, writing—original draft preparation, formal analysis; K.J.: validation, writing—review and editing; U.T.: writing—review and editing; G.W.: data collection, supervision, final draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the Tuition Fee Scholarship at UNSW.

Data Availability Statement

The data analyzed were from ATSB and are available under a Creative Commons Attribution 3.0 Australia license.

Acknowledgments

We would like to express our sincere gratitude to the ATSB authorities for providing the ATSB dataset, which was instrumental in conducting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Definition
ATSB	Australian Transport Safety Bureau
ASN	Aviation Safety Network
BERTopic	Bidirectional Encoder Representations from Transformers Topic Modeling
LDA	Latent Dirichlet Allocation
ML	Machine Learning
NMF	Non-Negative Matrix Factorization
NLP	Natural Language Processing
pLSA	Probabilistic Latent Semantic Analysis

References

Wild, G.J.I.A.; Magazine, E.S. Airbus A32x Versus Boeing 737 Safety Occurrences. IEEE Aerosp. Electron. Syst. Mag. 2023, 38, 4–12. [Google Scholar] [CrossRef]
Australian Transport Safety Bureau. Investigation Report; Australian Transport Safety Bureau: Canberra, ACT, Australia, 1999. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Phase of Flight Classification in Aviation Safety Using LSTM, GRU, and BiLSTM: A Case Study with ASN Dataset. In Proceedings of the 2023 International Conference on High Performance Big Data and Intelligent Systems (HDIS), Macau, China, 6–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 24–28. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Aviation Safety Enhancement via NLP & Deep Learning: Classifying Flight Phases in ATSB Safety Reports. In Proceedings of the 2023 Global Conference on Information Technologies and Communications (GCITC), Bangalore, India, 1–3 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Proceedings of the NIPS’00: Proceedings of the 14th International Conference on Neural Information Processing Systems, NeurIPS 2000, Denver, CO, USA, 27 November–2 December 2000. [Google Scholar]
Hofmann, T. Probabilistic Latent Semantic Analysis; UAI: Rio de Janeiro, Brazil, 1999; Volume 99, pp. 289–296. [Google Scholar]
Grootendorst, M.J. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Stoltz, D.S.; Taylor, M.A. text2map: R tools for text matrices. J. Open Source Softw. 2022, 7, 3741. [Google Scholar] [CrossRef]
Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Applications of natural language processing in aviation safety: A review and qualitative analysis. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 2153. [Google Scholar]
Yang, C.; Huang, C.J.A. Natural language processing (NLP) in aviation safety: Systematic review of research and outlook into the future. Aerospace 2023, 10, 600. [Google Scholar] [CrossRef]
Luo, Y.; Shi, H. Using lda2vec topic modeling to identify latent topics in aviation safety reports. In Proceedings of the 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), Beijing, China, 17–19 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 518–523. [Google Scholar]
Xu, J.; Li, T. Application of multimodal NLP instruction combined with speech recognition in oral english practice. Mob. Inf. Syst. 2022, 2022, 2262696. [Google Scholar] [CrossRef]
Ricketts, J.; Barry, D.; Guo, W.; Pelham, J.J.S. A scoping literature review of natural language processing application to safety occurrence reports. Safety 2023, 9, 22. [Google Scholar] [CrossRef]
Morais, C.; Yung, K.L.; Johnson, K.; Moura, R.; Beer, M.; Patelli, E. Identification of human errors and influencing factors: A machine learning approach. Saf. Sci. 2022, 146, 105528. [Google Scholar] [CrossRef]
Jiao, Y.; Dong, J.; Han, J.; Sun, H. Classification and causes identification of Chinese civil aviation incident reports. Appl. Sci. 2022, 12, 10765. [Google Scholar] [CrossRef]
Robinson, S.D. Visual representation of safety narratives. Saf. Sci. 2016, 88, 123–128. [Google Scholar] [CrossRef]
Ahadh, A.; Binish, G.V.; Srinivasan, R. Text mining of accident reports using semi-supervised keyword extraction and topic modeling. Process. Saf. Environ. Prot. 2021, 155, 455–465. [Google Scholar] [CrossRef]
Robinson, S.D. Temporal topic modeling applied to aviation safety reports: A subject matter expert review. Saf. Sci. 2019, 116, 275–286. [Google Scholar] [CrossRef]
Kuhn, K.D. Using structural topic modeling to identify latent topics and trends in aviation incident reports. Transp. Res. Part C Emerg. Technol. 2018, 87, 105–122. [Google Scholar] [CrossRef]
Zhong, B.; Pan, X.; Love, P.E.; Sun, J.; Tao, C. Hazard analysis: A deep learning and text mining framework for accident prevention. Adv. Eng. Inform. 2020, 46, 101152. [Google Scholar] [CrossRef]
Krishnan, A. Exploring the power of topic modeling techniques in analyzing customer reviews: A comparative analysis. arXiv 2023, arXiv:2308.11520. [Google Scholar]
Datchanamoorthy, K.J. Text mining: Clustering using bert and probabilistic topic modeling. Soc. Inform. J. 2023, 2, 1–13. [Google Scholar] [CrossRef]
Bellaouar, S.; Bellaouar, M.M.; Ghada, I.E. Topic modeling: Comparison of LSA and LDA on scientific publications. In Proceedings of the 2021 4th International Conference on Data Storage and Data Engineering, Barcelona, Spain, 18–20 February 2021; pp. 59–64. [Google Scholar]
Nanayakkara, A.C.; Thennakoon, G.J. Enhancing Social Media Content Analysis with Advanced Topic Modeling Techniques: A Comparative Study. Int. J. Adv. ICT Emerg. Reg. 2024, 17, 40–47. [Google Scholar] [CrossRef]
Kaur, A.; Wallace, J.R. Moving Beyond LDA: A Comparison of Unsupervised Topic Modelling Techniques for Qualitative Data Analysis of Online Communities. arXiv 2024, arXiv:2412.14486. [Google Scholar]
Bagheri, R.; Entezarian, N.; Sharifi, M.H. Topic Modeling on System Thinking Themes Using Latent Dirichlet Allocation, Non-Negative Matrix Factorization and BER Topic. J. Syst. Think. Pract. (JSTINP) 2023, 2, 33–56. [Google Scholar]
Abuzayed, A.; Al-Khalifa, H.J. BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Comput. Sci. 2021, 189, 191–194. [Google Scholar] [CrossRef]
Mihajlov, T.; Nešić, M.I.; Stanković, R.; Kitanović, O. Topic Modeling of the SrpELTeC Corpus: A Comparison of NMF, LDA, and BERTopic. In Proceedings of the 2024 19th Conference on Computer Science and Intelligence Systems (FedCSIS), Belgrade, Serbia, 8–11 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 649–653. [Google Scholar]
Rose, R.L.; Puranik, T.G.; Mavris, D.N.; Rao, A.H.J.R.E.; Safety, S. Application of structural topic modeling to aviation safety data. Reliab. Eng. Syst. Saf. 2022, 224, 108522. [Google Scholar] [CrossRef]
Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Exploring Aviation Incident Narratives Using Topic Modeling and Clustering Techniques. In Proceedings of the 2024 IEEE Region 10 Symposium (TENSYMP), New Delhi, India, 27–29 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Wild, G. Topic Modeling Analysis of Aviation Accident Reports: A Comparative Study between LDA and NMF Models. In Proceedings of the 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), Bangalore, India, 29–31 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–2. [Google Scholar]
Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Comparative Analysis of Topic Modeling Techniques on ATSB Text Narratives Using Natural Language Processing. In Proceedings of the 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 1–3 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Mbaye, S.; Walsh, H.S.; Jones, G.; Davies, M. BERT-based Topic Modeling and Information Retrieval to Support Fishbone Diagramming for Safe Integration of Unmanned Aircraft Systems in Wildfire Response. In Proceedings of the 2023 IEEE/AIAA 42nd Digital Avionics Systems Conference (DASC), Barcelona, Spain, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Xing, Y.; Wu, Y.; Zhang, S.; Wang, L.; Cui, H.; Jia, B.; Wang, H. Discovering latent themes in aviation safety reports using text mining and network analytics. Int. J. Transp. Sci. Technol. 2024, 16, 292–316. [Google Scholar] [CrossRef]
Liu, W.; Zhang, H.; Shi, Z.; Wang, Y.; Chang, J.; Zhang, J.J.S. Risk topics discovery and trend analysis in air traffic control operations—air traffic control incident reports from 2000 to 2022. Sustainability 2023, 15, 12065. [Google Scholar] [CrossRef]
Xu, Y.; Gan, Z.; Guo, R.; Wang, X.; Shi, K.; Ma, P.J.A. Hazard Analysis for Massive Civil Aviation Safety Oversight Reports Using Text Classification and Topic Modeling. Aerospace 2024, 11, 837. [Google Scholar] [CrossRef]
Paul, S.; Purkaystha, B.S.; Das, P.J. NLP TOOLS USED IN CIVIL AVIATION: A SURVEY. Int. J. Adv. Res. Comput. Sci. 2018, 9, 109–114. [Google Scholar] [CrossRef]
Blair, S.J.; Bi, Y.; Mulvenna, M.D. Aggregated topic models for increasing social media topic coherence. Appl. Intell. 2020, 50, 138–156. [Google Scholar] [CrossRef]
Wang, Y.-X.; Zhang, Y.-J. Nonnegative matrix factorization: A comprehensive review. IEEE Trans. Knowl. Data Eng. 2012, 25, 1336–1353. [Google Scholar] [CrossRef]
Eggert, J.; Korner, E. Sparse coding and NMF. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat No 04CH37541), Budapest, Hungary, 25–29 July 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 2529–2533. [Google Scholar]
Song, H.A.; Lee, S.-Y. Hierarchical representation using NMF. In Proceedings of the Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Republic of Korea, 3–7 November 2013; Proceedings, Part I 20. Springer: Berlin/Heidelberg, Germany, 2013; pp. 466–473. [Google Scholar]
Tijare, P.; Rani, P.J. Exploring popular topic models. J. Phys. Conf. Ser. 2020, 1706, 012171. [Google Scholar] [CrossRef]
Galli, C.; Colangelo, M.T.; Meleti, M.; Guizzardi, S.; Calciolari, E. Topic Analysis of the Literature Reveals. Big Data Cogn. Comput. 2024, 9, 7. [Google Scholar] [CrossRef]
Vorontsov, K.; Potapenko, A.J.M.L. Additive regularization of topic models. Mach Learn. 2015, 101, 303–323. [Google Scholar] [CrossRef]
Wang, R.-S.; Zhang, S.; Wang, Y.; Zhang, X.-S.; Chen, L.J.N. Clustering complex networks and biological networks by nonnegative matrix factorization with various similarity measures. Neurocomputing 2008, 72, 134–141. [Google Scholar] [CrossRef]
Egger, R.; Yu, J.J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
Mbaye, S.; Walsh, H.S.; Davies, M.; Infeld, S.I.; Jones, G. From BERTopic to SysML: Informing Model-Based Failure Analysis with Natural Language Processing for Complex Aerospace Systems. In Proceedings of the AIAA SCITECH 2024 Forum, Orlando, FL, USA, 8–12 January 2024; p. 2700. [Google Scholar]
Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101 (Suppl. 1), 5228–5235. [Google Scholar] [CrossRef] [PubMed]
Shastry, P.; Prakash, C. Comparative analysis of LDA, LSA and NMF topic modelling for web data. AIP Conf. Proc. 2023, 2901, 060006. [Google Scholar]
Blei, D.M.; Lafferty, J.D. A correlated topic model of science. Ann. Appl. Stat. 2007, 1, 17–35. [Google Scholar] [CrossRef]
Bosch, A.; Zisserman, A.; Munoz, X. Scene classification via pLSA. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part IV 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 517–530. [Google Scholar]
Gaussier, E.; Goutte, C. Relation between PLSA and NMF and implications. In Proceedings of the 28th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 15 August 2005; pp. 601–602. [Google Scholar]

Figure 1. Methodological architecture.

Figure 2. A visualization of the inter-topic distances generated using pyLDAvis.

Figure 3. Intuition of NMF.

Figure 4. BERTopic’s interactive inter-topic distance map.

Figure 5. Shows the coherence scores and perplexity for each model.

Figure 6. Topic distribution for the PLSA model.

Figure 7. Topic distribution for the NMF model.

Figure 8. Topic distribution for the LDA model.

Figure 9. Wordcloud for pLSA.

Figure 10. Wordcloud for NMF.

Figure 11. Wordcloud for LDA.

Figure 12. Wordcloud for BERTopic.

Figure 13. Topic word score for NMF.

Figure 14. Topic word score for BERTopic’s.

Figure 15. Topic word score for pLSA.

Figure 16. Top words for each topic chosen by LDA.

Figure 17. Top words for each topic chosen by NMF.

Figure 18. Top words for each topic chosen by the pLSA.

Figure 19. Top words for each topic chosen by the BERTopic model.

Table 2. Sample records from the ATSB dataset.

Ref.	Date/Time	Location	State	Phase of Flight	Summary	Injury Level	Departure/Destination
OA2013-00142	1/1/2013 12:01 a.m.	near Port Hedland Aerodrome	WA	Descent	During the descent, the aircraft was struck by lightning…	Nil	Perth [YPPH] → Port Hedland [YPPD]
OA2013-00167	1/1/2013 1:00 a.m.	near Williamtown Aerodrome	NSW	Initial Climb	During the initial climb, the aircraft encountered windshear…	Unknown	Williamtown [YWLM] → Melbourne [YMML]
OA2013-00196	1/1/2013 1:00 a.m.	Bali International Airport	Other	Standing	Passenger declared undeclared fireworks on board…	Nil	Bali [WADD] → Perth [YPPH]
OA2013-00053	1/1/2013 8:40 a.m.	Groote Eylandt Aerodrome	NT	Take-off	During take-off, the aircraft struck a bird…	Nil	Groote Eylandt [YGTE] → Cairns [YBCS]
OA2013-00087	1/5/2013 12:01 a.m.	Toowoomba Aerodrome	QLD	Unknown	During a runway inspection, ground staff retrieved a bird carcass…	Nil	-
OA2013-00045	1/5/2013 6:55 a.m.	Darwin Aerodrome	NT	Initial Climb	During the initial climb, the aircraft struck a …	Nil	Darwin [YPDN] → Dili [WPDL]
OA2013-00248	1/5/2013 8:00 a.m.	Isisford (ALA)	QLD	Landing	Aircraft bounced on one wheel in crosswind landing. Gear detached during go-around and subsequent landing caused substantial damage…	Substantial	Isisford [YISF] → Isisford [YISF]
OA2013-00055	1/5/2013 8:30 a.m.	Ballina/Byron Gateway Aerodrome	NSW	Approach	During final approach, the aircraft struck a swallow…	Nil	Sydney [YSSY] → Ballina/Byron [YBNA]
OA2013-00245	1/5/2013 8:30 a.m.	Sydney Aerodrome	NSW	Standing	Ground staff observed smoke from the APU; engineers identified oil leak as the source…	Nil	Sydney [YSSY]
OA2013-00067	1/5/2013 8:35 a.m.	Perth Aerodrome	WA	Unknown	During a runway inspection, the safety officer retrieved a kestrel carcass…	Nil	-
OA2013-00051	1/5/2013 9:30 a.m.	Parafield Aerodrome	SA	Take-off	During take-off run, the aircraft struck a magpie.	Nil	Parafield [YPPF] → Parafield [YPPF]
OA2013-00634	1/5/2013 10:15 a.m.	Perth Aerodrome	WA	Initial Climb	Crew received pitot static system warnings and returned. Engineering found no faults.	Nil	Perth [YPPH] → Sydney [YSSY]
OA2013-00233	1/5/2013 10:40 a.m.	near Karratha Aerodrome	WA	Cruise	Crew received GPU warning during cruise and returned. Inspection found GPU door not latched correctly.	Nil	Karratha [YPKA] → Karratha [YPKA]

Table 3. Summarizes the performance for each model.

Models	Coherence Score	Perplexity
pLSA	0.7634	−4.6237
LDA	0.4394	−6.471
NMF	0.7987	2.0739
BERTopic	0.264	−4.638

Table 4. The top 10 words chosen by each model, along with the corresponding theme.

LDA Model	BERTopic Model	pLSA Model	NMF Model	Theme	Topic No.
fumes, detected, failed, cabin, cruise, crew, descent, engineering, inspection, source	bird, landing, struck, butcherbird, parrot, turkey, aircraft, bundey, during, pygmy	aircraft, struck, landing, take, bird, approach, runway, multiple, taxi, magpie, entered, without	bird, struck, aircraft, approach, climb, kite, initial, takeoff, landing, run	Bird Strikes and Landing Issues	0
aircraft, crew, separation, runway, ATC, approach, resulting, observed, Cessna, loss	bird, approach, struck, pale, fantail, durig, frigatebird, stilt, aircraft, blackbird	damage, aircraft, resulting, minor, landing, pilot, sustained, collided, terrain, substantial, operations, control	approach, missed, windshear, conducted, encountered, crew, aircraft, final, flap, ft	Approach and Airspace Separation	1
take, rejected, crew, swallow, rough, martin, fairy, WINDSHEAR, lapwing, masked	entered, taxi, without, clearance, runway, duty, runways, strip, transmission, comply	approach, crew, aircraft, missed, conducted, encountered, flap, windshear, final, runway, PA28, turbulence	strike, evidence, occurred, determined, birdstrike, flight, post, detected, inspection, pre	Clearance and Taxiway Incidents	2
engine, inspection, flight, detected, climb, post, fuel, routine, revealed, determined	initial, cocos, ngukurr, durign, durring, climb, bird, maryborough, denpasar, parrot	received, landing, crew, g0ar, alert, approach,1GPWS, E, indication2 unsafe, climb, faile3	landing, struck, aircraft, multiple, magpie, roll, gear, bat, galah, swallow	Engine and Mechanical Failures	3
pilot, aircraft, flight, helicopter, terrain, control, damage, sustained, increase, collided	windscreen, cracked, shattered, window, pane, windshield, outer, arcing, layer, heating	pilot, ft, aircraft, flight, Passing, helicopter, runway, observed, VH, circuit, registered, normal	retrieved, officer, safety, carcass, runway, routine, inspection, fox, flying, magpie	Pilot Operations and Mid-Air Collisions	4
RPA, operations, aircraft, aerial, ATC, normal, resulting, collided, door, communications	takeoff, dave, fortescue, bird, forrest, nadi, swan, turkey, lilydale, winged	crew, inspection, engineering, detected, revealed, returned, replaced, fumes, Engineers, engine, aircraft, climb	engine, failed, cruise, crew, returned, engineering, revealed, climb, inspection, gear	RPA and ATC Operations	5
approach, aircraft, crew, encountered, alert, missed, received, conducted, GPWS, clearance	pre, birdstrike, evidence, strike, could, occurred, determined, deteremined, flight, bridstrike	inspection, safety, officer, flight, runway, retrieved, post, carcass, routine, detected, determined, could	resulting, damage, minor, encountered, aircraft, turbulence, pilot, substantial, sustained, collided	Approach and Safety Warnings	6
aircraft, struck, landing, bird, damage, minor, resulting, approach, runway, multiple	final, stint, bird, edinburgh, necked, raven, approach, struck, feet, red	fuel, flight, issue, aircraft, destroyed, investigation, pre, crew, balloon, due, pitot, Jandakot	fumes, cabin, detected, source, descent, cockpit, engineering, did, reveal, inspection	Landing and Fuel System Issues	7
crew, received, landing, gear, approach, engineering, returned, replaced, Engineers, aircraft	climbing, turn, bank, angle, gpws, alert, anlge, received, recieved, crew	engine, crew, pilot, RPA, aircraft, observed, radio, failed, TCAS, cruise, RA, received	gpws, received, alert, bank, angle, crew, climb, warning, climbing, turn	GPWS and Flight Alerts	8
runway, safety, officer, inspection, retrieved, carcass, forced, Australian, partial, determine	plover, spur, winged, landing, struck, aircraft, minor, damage, resulting	separation, ATC, aircraft, resulting, crew, runway, clearance, Cessna, loss, track, without, controller	runway, clearance, entered, taxi, aircraft, atc, airspace, controlled, incorrect, separation	Runway Incursions	9

Table 5. Shows the strengths and weaknesses of each model.

Model	Strengths	Limitations
LDA	- Provides interpretable topics. - Generates distinct word clusters for each topic. - Works well for short and long texts.	- Requires manual tuning of the number of topics. - Struggles with overlapping topics. - Topics can be less coherent for complex datasets.
BERTopics	- Uses word embeddings, making it context-aware. - Provides dynamic topic reduction. - Can handle large datasets efficiently. - Visualizations (e.g., topic evolution, similarity graphs).	- Computationally expensive due to transformers. - Requires fine-tuning of hyperparameters. - Less interpretable than LDA.
pLSA	- Good for small datasets. - Finds latent structures in data. - Works well with document similarity tasks.	- Suffers from overfitting. - Does not generalize well to new data. - Lacks a probabilistic prior, leading to instability.
NMF	- Produces coherent topics. - Works well for short documents. - More deterministic (less randomness).	- Requires normalized data. - Less flexible for diverse document structures. - Can be sensitive to noise in data

Table 6. Evaluates each model on six key aspects: interpretability, granularity, scalability, topic coherence, computational cost, and flexibility.

Aspect	LDA (Latent Dirichlet Allocation)	BERTopic (Bidirectional Encoder Representations for Topics)	pLSA (Probabilistic Latent Semantic Analysis)	NMF (Non-Negative Matrix Factorization)
Interpretability	Topics are easy to interpret	Less interpretable, depends on embeddings	Moderate, lacks probabilistic priors	Interpretable, but depends on preprocessing
Granularity	May mix topics if not well-tuned	Fine-grained topic separation	Decent granularity, but can mix topics	Creates distinct topics, good separation
Scalability	Scales well, but is slow on large data	Scales well, but computationally expensive	Struggles with large datasets	Scales well but is sensitive to noise
Topic Coherence	Good but requires tuning	Leverages contextual embeddings	Can generate less coherent topics	Produces clear topics with distinct words
Computational Cost	Moderate, but increases with more topics	High due to transformer embeddings	Computationally expensive, not scalable	Moderate, needs matrix factorization
Flexibility	Requires parameter tuning for coherence	Highly flexible, allows dynamic topic modeling	Less flexible, predefined number of topics	Requires non-negative constraints
Best Use Case	General topic modeling, balanced performance	Complex text, fine-grained topics, embeddings-based	Small datasets, early-stage analysis	Text data with clear structure, document clustering

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment. Technologies 2025, 13, 209. https://doi.org/10.3390/technologies13050209

AMA Style

Nanyonga A, Joiner K, Turhan U, Wild G. Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment. Technologies. 2025; 13(5):209. https://doi.org/10.3390/technologies13050209

Chicago/Turabian Style

Nanyonga, Aziida, Keith Joiner, Ugur Turhan, and Graham Wild. 2025. "Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment" Technologies 13, no. 5: 209. https://doi.org/10.3390/technologies13050209

APA Style

Nanyonga, A., Joiner, K., Turhan, U., & Wild, G. (2025). Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment. Technologies, 13(5), 209. https://doi.org/10.3390/technologies13050209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Collection and Preprocessing

3.2. Topic Modeling Techniques

3.3. Model Evaluation

3.4. Implementation Framework

4. Results

4.1. Coherence Score and Perplexity

4.2. Interpretability Assessment

4.2.1. Topic Distribution

4.2.2. Topic Wordcloud

4.2.3. Topic Word Scores

4.3. Model Comparison

4.3.1. Top 10 Words per Model

4.3.2. Model Strengths and Limitations

4.3.3. Model Evaluation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI