Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA

Nanyonga, Aziida; Joiner, Keith; Turhan, Ugur; Wild, Graham

doi:10.3390/aerospace12060551

Open AccessArticle

Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA

by

Aziida Nanyonga

¹

,

Keith Joiner

²

,

Ugur Turhan

³

and

Graham Wild

^3,*

¹

School of Engineering and Technology, University of New South Wales, Canberra, ACT 2600, Australia

²

Capability Systems Centre, University of New South Wales, Canberra, ACT 2610, Australia

³

School of Science, University of New South Wales, Canberra, ACT 2612, Australia

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(6), 551; https://doi.org/10.3390/aerospace12060551

Submission received: 17 May 2025 / Revised: 14 June 2025 / Accepted: 16 June 2025 / Published: 16 June 2025

(This article belongs to the Section Air Traffic and Transportation)

Download

Browse Figures

Versions Notes

Abstract

Aviation safety analysis increasingly relies on extracting actionable insights from narrative incident reports to support risk identification and improve operational safety. Topic modeling techniques such as Probabilistic Latent Semantic Analysis (pLSA) and BERTopic offer automated methods to uncover latent themes in unstructured safety narratives. This study evaluates the effectiveness of each model in generating coherent, interpretable, and semantically meaningful topics for aviation safety practitioners and researchers. We assess model performance using both quantitative metrics (topic coherence scores) and qualitative evaluations of topic relevance. The findings show that while pLSA provides a solid probabilistic framework, BERTopic leveraging transformer-based embeddings and HDBSCAN clustering produces more nuanced, context-aware topic groupings, albeit with increased computational demands and tuning complexity. These results highlight the respective strengths and trade-offs of traditional versus modern topic modeling approaches in aviation safety analysis. This work advances the application of natural language processing (NLP) in aviation by demonstrating how topic modeling can support risk assessment, inform policy, and enhance safety outcomes.

Keywords:

aviation safety; topic modeling; BERTopic; pLSA; ASN reports; text mining

1. Introduction

The analysis of aviation safety reports plays a vital role in identifying recurring hazards, understanding contributory factors, and implementing corrective actions to enhance flight safety [1]. These reports, often prepared by pilots, air traffic controllers, and safety investigators, contain unstructured textual narratives that describe the sequence of events, environmental conditions, and operational decisions leading to aviation incidents and accidents. Such qualitative data, while rich in insight, is challenging to analyze at scale using traditional manual methods. As global aviation activity continues to rise, the accumulation of safety data has become vast and complex, making it necessary to adopt advanced computational methods for processing and interpreting these narratives [2].

In recent years, topic modeling has emerged as a valuable text mining approach for exploring hidden themes and semantic structures within large corpora. By automatically discovering latent topics in textual data, topic modeling allows safety analysts and researchers to group similar terms and uncover prevalent issues across incident reports. This contributes to enhanced situational awareness, data-driven risk assessment, and more informed policy development. Classic topic modeling approaches such as Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (pLSA) have been widely adopted in domains such as healthcare, legal analysis, and social sciences [3,4,5]. In aviation, these methods have been used to explore patterns in flight safety narratives, pilot reports, and accident databases [6,7].

PLSA, introduced by Hofmann (1999), models the probability of a word given a document through a latent class model that assumes each document is a mixture of topics and each topic is a distribution over words [8]. While pLSA was foundational in demonstrating the potential of probabilistic models for document clustering, it suffers from several limitations. These include its tendency to overfit, lack of a generative model for new documents, and difficulty scaling to large corpora. Additionally, pLSA’s bag-of-words assumption fails to capture semantic relationships and contextual meanings, which are especially important in safety-critical narratives where subtle language nuances can signify distinct operational risks.

To address these shortcomings, modern topic modeling approaches have begun leveraging recent advances in deep learning and language modeling. One such model is BERTopic, which integrates transformer-based embeddings (e.g., BERT) with clustering algorithms such as Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) and dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP). By generating dense vector representations of text that retain contextual information, BERTopic enables dynamic topic extraction with higher semantic accuracy and interpretability [9,10]. This contextual awareness is critical in domains like aviation safety, where reports may include domain-specific jargon, evolving operational conditions, and intricate incident sequences. For instance, BERTopic can distinguish between topics like “engine flameout during takeoff” and “engine malfunction during cruise” due to its ability to capture the context surrounding keywords.

To address this gap, the present study evaluates and compares the effectiveness of two topic modeling techniques, pLSA and BERTopic, in extracting interpretable and semantically coherent topics from aviation safety narratives. The overarching goal is to guide aviation safety analysts, data scientists, and policymakers in selecting the most suitable method for analyzing unstructured, narrative-based safety data. Using a proprietary dataset obtained from the Aviation Safety Network (ASN), which includes 4282 aviation incident reports categorized by damage levels, both models are applied to uncover latent themes related to mechanical failures, pilot misjudgments, and other operational scenarios. This study assesses model performance using quantitative metrics such as topic coherence (via the C_v score) and qualitative expert validation to determine topic relevance and interpretability. In doing so, this work contributes to aviation safety research by demonstrating the potential of advanced, context-aware NLP techniques to support automated risk detection and enhance data-driven decision-making within safety management systems (SMSs).

The remainder of the paper is organized as follows: Section 2 presents a review of relevant studies in topic modeling, particularly in the context of aviation safety. Section 3 details the methodology, including dataset characteristics and evaluation criteria. Section 4 discusses the experimental results, while Section 5 provides a detailed analysis. Finally, Section 6 concludes with insights and future research directions.

2. Related Work

Topic modeling has become a critical technique in natural language processing (NLP) for uncovering latent semantic structures within large collections of textual data. Over the years, various methods have been proposed to perform topic modeling, each offering unique strengths and weaknesses. This section provides an overview of key research in topic modeling, focusing on two influential approaches: pLSA and BERTopic, along with their applications in aviation safety and other domains.

pLSA is one of the seminal probabilistic approaches to topic modeling, introduced by Hofmann (1999) [8]. pLSA models the co-occurrence of terms and documents through the introduction of a latent variable that represents topics. The underlying assumption of PLSA is that each document is a mixture of topics, where each topic is characterized by a probability distribution over words. This probabilistic framework enables the extraction of hidden thematic structures from large datasets, facilitating the analysis of textual data in diverse domains [11].

Despite its foundational role in topic modeling, pLSA exhibits several limitations. One of the primary drawbacks is its tendency to overfit, especially when applied to smaller datasets, due to its reliance on a fixed generative model for new documents [12,13]. Additionally, PLSA employs the bag-of-words (BoW) model, which simplifies word relationships and fails to capture the semantic context of words, which are crucial issues when dealing with complex domains like aviation safety. This limitation results in topics that may lack meaningful semantic relationships, making pLSA less suitable for tasks where context and domain-specific knowledge are critical.

Despite these challenges, pLSA has been successfully applied in various fields, such as bioinformatics, where it has been used to analyze gene sequence data [14,15], and information retrieval [16], where it has been used to extract meaningful themes from large-scale document collections. Its contributions to the development of topic modeling have shaped subsequent advancements in this area.

In contrast to pLSA, BERTopic represents a more recent advancement in topic modeling that incorporates transformer-based embeddings combined with clustering techniques to generate dynamic and context-aware topics. BERTopic leverages sentence embeddings from transformer models like BERT (Bidirectional Encoder Representations from Transformers), which capture semantic relationships between words and phrases [17,18,19]. This enables BERTopic to generate more coherent, interpretable, and domain-relevant topics compared to traditional methods such as pLSA and Latent Dirichlet Allocation (LDA) [20].

BERTopic’s ability to use BERT embeddings allows it to capture contextual relationships that are important for understanding complex and evolving language. This is especially beneficial in domains like aviation safety, where terminology and incident descriptions often involve intricate details, such as “mechanical failure,” “pilot misjudgment,” or “airspace incursion” [9]. Additionally, BERTopic employs UMAP or HDBSCAN for dimensionality reduction and clustering, which allows it to handle high-dimensional data effectively and uncover meaningful topics even in noisy or sparse datasets [21].

Studies have demonstrated BERTopic’s superiority in domains requiring nuanced understanding, such as healthcare, where it has been used to identify themes from clinical notes [22], finance, where it has been applied to analyze financial documents [23], and social media analysis, where it has been used to extract insights from large volumes of user-generated content [24]. BERTopic’s ability to model the evolving nature of language makes it particularly suitable for complex domains like aviation safety, where the language used in safety reports can change over time as new technologies and practices emerge.

The application of topic modeling in aviation safety is an emerging field. Aviation safety reports, particularly those produced by ASN, contain valuable unstructured data that detail incidents and accidents, providing insights into systemic safety issues. These reports often describe critical events such as mechanical failures, human errors, and operational deficiencies, all of which are essential for improving safety protocols and preventing future occurrences [25].

Earlier studies have employed methods like LDA to extract themes from aviation safety data. LDA, a generative model that assumes documents are mixtures of topics and topics are mixtures of words, has been widely used in various text mining applications. However, its reliance on simplistic text representations often leads to less coherent topics and limits its ability to capture domain-specific nuances. For example, LDA has been used to model aviation safety reports, but it frequently generates topics that lack clarity and interpretability due to its oversimplified assumptions [26]. Similarly, pLSA has been applied in early studies of aviation safety reports but has faced challenges in scalability and fails to account for complex relationships within the text, particularly when interpreting industry-specific terminology like “aeronautical hazards” or “pilot deviation” [27].

Recent advancements, including the introduction of BERTopic, have shown considerable promise in aviation safety research. For instance, in 2022, Grootendorst demonstrated the effectiveness of BERTopic in analyzing complex datasets, including safety and risk analysis reports. Compared to traditional methods like PLSA, BERTopic provides better scalability, coherence, and interpretability, making it more suitable for analyzing the detailed and often lengthy narratives found in aviation safety reports [28]. The ability of BERTopic to adapt to the evolving language and context of aviation safety reports makes it an ideal tool for identifying emerging safety issues, monitoring trends, and categorizing risks.

While both pLSA and BERTopic have demonstrated success in different domains, few studies have directly compared these two methods, especially in the context of aviation safety. A comparative study by Ibraimoh et al. [29] highlighted the superiority of BERTopic over pLSA in terms of coherence and interpretability when applied to datasets such as stock overflow incidents. Similarly, research has pointed out that transformer-based models like BERT outperform traditional probabilistic methods when applied to complex textual data, such as customer support documents and incident reports [6,30].

However, there remains a gap in the literature regarding direct comparisons between pLSA and BERTopic in aviation safety contexts. The few available studies that investigate topic modeling in aviation safety focus on traditional models and do not fully explore the advantages of transformer-based techniques like BERTopic in this domain. This study seeks to fill this gap by providing a detailed comparison of pLSA and BERTopic in extracting relevant and coherent topics from aviation safety reports, particularly focusing on the ASN dataset.

Building upon previous studies, this research aims to compare the effectiveness of pLSA and BERTopic in extracting meaningful and coherent topics from aviation safety reports. This study emphasizes the relevance, coherence, and domain-specific applicability of the models when applied to the ASN dataset, which contains detailed and categorized reports on aviation incidents. This comparison is expected to offer new insights into the strengths and limitations of both approaches in the context of aviation safety data, providing valuable contributions to the growing body of literature on NLP and topic modeling in aviation research.

3. Materials and Methods

This section provides a comprehensive outline of the methodology employed to evaluate and compare the performance of pLSA and BERTopic in extracting meaningful topics from aviation safety reports. The process consists of data collection and preprocessing, followed by the implementation of the two topic modeling techniques. The evaluation of their effectiveness is conducted using several performance metrics, as depicted in Figure 1.

3.1. Data Collection

Aviation incident and accident investigation reports are rich sources of information that detail the nature and causes of aviation safety events. Various organizations publish these reports, including ASN, Australian Transport Safety Bureau (ATSB), the Aviation Safety Reporting System (ASRS), and the National Transportation Safety Board (NTSB). For this study, the focus was specifically placed on the ASN aviation incident and accident investigation reports. The dataset utilized in this study covers reports from 2013 to 2022, spanning a decade of aviation safety events. The data was directly sourced from the ASN website, and the dataset consists of 4875 records. The dataset includes detailed narratives of incidents and accidents, categorized by damage levels (e.g., Damaged beyond repair, Missing, Substantial, Destroyed, None, Minor, Unknown). After data cleaning and preprocessing steps, a refined dataset of 4282 records was created. This dataset, which serves as the core data for the analysis, includes the “Narrative” and “Damage Level” fields, which capture the textual content of the reports and the severity of damage level of the aircraft. These two fields provide rich, unstructured data that is ideal for topic modeling techniques such as pLSA and BERTopic.

3.2. Data Processing

Before applying topic modeling techniques, the raw text data required preprocessing to ensure consistency and quality for analysis. The preprocessing steps involved several key stages to clean and transform the data into a suitable format for topic modeling: the first step was tokenization, where the text was split into individual words or tokens using the Natural Language Toolkit (NLTK). Tokenization is a crucial step in NLP that prepares the text for further analysis by breaking it down into manageable units. All text was converted to lowercase to standardize the tokens and remove any case sensitivity that could affect the consistency of the topic modeling process. Commonly used words such as “the,” “and”, “or”, “of,” etc., which carry little semantic value, were removed using the NLTK stopword list. Removing these stopwords ensures that the model focuses on the more meaningful terms in the text. The next step was lemmatization, where the text was lemmatized using the WordNetLemmatizer from NLTK, which reduced words to their base or root forms. For example, words like “running” and “ran” were reduced to the base form “run,” thus improving consistency in the text and ensuring that variations of the same word are treated as a single token. Finally, all special characters, punctuation marks, numbers, and other non-alphabetic symbols were filtered out to ensure that the dataset only contained meaningful textual content. After these preprocessing steps, the text data was transformed into a clean and uniform corpus. This ensured that both pLSA and BERTopic received identical input, making it possible to conduct a fair and valid comparison between the two topic modeling approaches.

3.3. Topic Modeling Procedure

Once the data was preprocessed, it was ready for topic modeling. Both pLSA and BERTopic were applied independently to extract latent topics from the dataset, as outlined in the methodology framework (Figure 1). For PLSA, the process began with the creation of a document–term matrix (DTM), which is a sparse matrix where each row represents a document, and each column represents a unique word in the corpus. The values in the matrix represent the frequency of terms in each document. Model fitting was carried out using the Expectation Maximization (EM) algorithm, which iteratively estimates the topic–word and document–topic distributions. These distributions were used to identify the topics in the corpus, each represented as a probabilistic distribution over words. The resulting topics were then evaluated for coherence and relevance. On the other hand, BERTopic uses transformer-based models to generate high-dimensional embeddings. The text was first converted into embeddings using pre-trained models like BERT, which represent the semantic content of the documents in a high-dimensional vector space. These embeddings were then clustered using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm. This clustering method groups the embeddings that are close together in the vector space, typically corresponding to documents that are thematically related. Once the clusters were formed, dynamic topic representation was used to label each cluster with interpretable topics. This process involves extracting the most representative words from each cluster and assigning them as labels that best describe the topic. These two topic modeling techniques, pLSA and BERTopic, were thus implemented independently to extract meaningful topics from the aviation safety reports dataset. BERTopic also generates an Intertopic Distance Map as part of its output. In this visualization, each circle represents a distinct topic identified by the model, with the size of each circle indicating the relative frequency of that topic within the dataset. The map is created using Uniform Manifold Approximation and Projection (UMAP), which reduces the high-dimensional topic embeddings into two dimensions (D1 and D2) to enable intuitive visual interpretation.

3.4. Evaluation Metrics

To evaluate and compare the performance of pLSA and BERTopic, a combination of quantitative and qualitative metrics was employed. Topic coherence, a widely used statistical metric, was used to assess semantic similarity among top words within each topic. Specifically, the C_v coherence metric from Gensim’s Coherence Model was adopted, as it has been shown to correlate well with human interpretability. Higher coherence values typically indicate more semantically meaningful and interpretable topics.

In addition to coherence, interpretability and thematic relevance were assessed via expert evaluation. Three independent aviation domain experts, each with over five years of experience in aviation safety analysis or investigation, reviewed automatically generated topics and manually assigned human-readable labels. These experts also evaluated whether the topics meaningfully represented distinct patterns in the data. Finally, scalability was evaluated by tracking training time and memory usage during model execution [31], particularly given the 4282 narrative records in the dataset. This metric offered practical insights into the models’ suitability for deployment in real-world aviation contexts.

3.5. Experimental Setup

All experiments were implemented using Python 3.10 in a computing environment with an Intel i7 processor, 32GB RAM, and an NVIDIA GPU, which accelerated the processing of transformer-based embeddings used in the BERTopic model. The key libraries used include BERTopic 0.13.0, Gensim 4.3.1, and NLTK 3.8.0.

For the pLSA model, the number of topics (k = 10) was selected after a series of preliminary runs where coherence and perplexity scores were monitored across different values of k. We observed diminishing returns in coherence improvement beyond 10 topics, with increased overlap between topics, which reduced interpretability. Thus, k = 10 was chosen as the optimal trade-off between statistical quality and semantic clarity.

In the BERTopic model, we used the default BERT-based transformer embeddings and adjusted the min_cluster_size to 15 to prevent topic fragmentation. We also tuned nr_topics to approximate the same number of topics as pLSA (i.e., 10), to enable a fair and consistent comparison across both models. These parameters were selected based on iterative runs optimizing topic coherence and topic diversity, while observing the granularity of output topics in domain-relevant contexts. Importantly, the impact of parameter choices was evident during tuning. For instance, reducing the number of topics led to overly broad themes, while higher values resulted in fragmentation and reduced topic coherence. Similarly, in BERTopic, a low min_cluster_size generated redundant micro-topics, while a high value merged distinct safety themes. Thus, the final parameter values represent a balance, tuned empirically, to maximize interpretability and statistical performance within the aviation safety domain. These empirically informed settings ensured that both models were evaluated under conditions that optimized their respective strengths while maintaining fairness and alignment in comparison.

4. Results

This section presents the findings from the application of two topic modeling techniques, pLSA and BERTopic, on a curated dataset obtained from the ASN. The analysis aimed to identify latent themes within narrative descriptions of aviation incidents and accidents, categorized by aircraft damage severity.

4.1. Model Performance Metrics

The performance of the pLSA and BERTopic models was evaluated using multiple topic modeling metrics to provide a more comprehensive assessment. Initially, two widely accepted indicators, topic coherence and perplexity, were employed. Topic coherence, computed using the UMass coherence measure, evaluates the degree of semantic similarity between high-probability words within a topic. Perplexity, on the other hand, assesses the model’s ability to predict unseen data, with lower values indicating better generalization. The pLSA model achieved a coherence score of 0.7634 and a perplexity of −4.6237, reflecting strong statistical consistency and semantically meaningful topic clusters. In contrast, the BERTopic model obtained a lower UMass coherence score of 0.531 but a better perplexity of −5.5377, suggesting improved predictive performance. Despite its lower coherence, BERTopic demonstrated greater interpretability due to its integration of transformer-based embeddings and dimensionality reduction via UMAP, which enables dynamic topic refinement and clearer visualization. To further strengthen the evaluation, we introduced the Topic Diversity metric, which measures the proportion of unique words across the top n words of all topics. This helps assess the lexical distinctiveness of topics and mitigate redundancy. BERTopic achieved a higher topic diversity score of 0.89, compared to 0.71 for pLSA, indicating that it produced more lexically diverse and distinguishable topics.

4.2. Comparative Topic Analysis

4.2.1. Topic Words and Thematic Labels

Table 1 illustrates the top 10 words for each topic derived by both models, along with assigned thematic labels based on manual inspection and cross-validation by aviation domain experts. Their expertise ensured that the extracted topics aligned with real-world operational contexts and safety concerns, improving the reliability of thematic interpretation. While both models extracted topics related to aviation incidents, BERTopic consistently yielded more semantically cohesive and domain-relevant themes (e.g., Bird Strike Investigations, Helicopter Operations, Engine Failures), as confirmed by expert reviewers. In contrast, pLSA occasionally grouped semantically disjointed terms (e.g., “aircraft,” “drugs,” “tug”) under a single topic, which complicated thematic labeling. Figure 2 and Figure 3 further illustrate the top terms per topic for BERTopic and pLSA, respectively.

4.2.2. Visualization and Interpretability

Figure 4 presents a visual representation of the semantic relationships between the topics identified by the BERTopic model. Each topic is depicted as a circle, with its size reflecting the frequency of that topic across the narrative dataset. The horizontal (D1) and vertical (D2) axes represent the two-dimensional space generated by UMAP, a dimensionality reduction technique used to preserve semantic similarity among high-dimensional topic embeddings. In this visualization, topics that appear closer together are more semantically related, while those positioned farther apart are contextually distinct.

This visualization is particularly helpful for identifying clusters of related topics and detecting outliers. For example, a tightly grouped set of topics might indicate a coherent theme, such as in-flight mechanical issues, whereas an isolated topic could point to a distinct narrative theme. Furthermore, the model’s topic reduction capability allows for the merging of overlapping topics, refining the thematic structure without human intervention. These features make the intertopic map an intuitive tool for understanding topic relationships. In contrast, Figure 5 and Figure 6 present the top words for each topic chosen by BERTopic and pLSA, respectively, highlighting the relative importance of each word within its assigned topic. Although pLSA offers statistically tight clusters, BERTopic’s visualization facilitates better exploration and thematic understanding, especially for non-technical users.

4.3. Word Clouds and Topic Distribution

Beyond semantic clustering, this study employed word cloud visualizations further to explore the lexical distribution and salience of topic terms, as shown in Figure 7 and Figure 8. This offers a visual summary of the most prominent terms across topics. BERTopic’s cloud displayed more differentiated and semantically tight terms, further supporting the interpretability advantage. Figure 9 shows the topic distribution from the pLSA model, where topic dominance and overlap can be inferred; it shows that the most prominent words were chosen by topic 2. BERTopic’s distribution, though not shown here, displayed a more balanced and non-redundant topic separation.

4.4. Evaluation of Model Properties

Table 2 summarizes the observed strengths and weaknesses of the two models. While BERTopic demonstrated flexibility in topic resolution and clarity in thematic visualization, pLSA excelled in computational efficiency and statistical coherence.

4.5. Ablation Study

To further understand the individual contributions of each modeling component, an ablation study was conducted on both pLSA and BERTopic models. The study systematically varied preprocessing steps, topic count, and dimensionality reduction techniques to evaluate their effect on performance metrics and interpretability. Interpretability outcomes from each variant were qualitatively assessed by aviation experts, reinforcing the practical relevance of model adjustments.

For the pLSA model, removing stopword filtering and lemmatization led to a 7.4% drop in coherence, affirming the importance of these preprocessing techniques. Additionally, varying the number of topics from 5 to 20 revealed an optimal range between 9 and 12 topics, beyond which topics became redundant or overly fragmented. Similarly, the use of TF-IDF versus raw term frequency showed negligible impact on perplexity but reduced coherence by 0.06.

In contrast, BERTopic’s performance was more sensitive to changes in the embedding model. When Sentence-BERT embeddings were replaced with TF-IDF vectors, coherence dropped from 0.531 to 0.411. The choice of dimensionality reduction method was also crucial. Using PCA instead of UMAP reduced visual clarity and interpretability in the intertopic distance map, reinforcing the effectiveness of UMAP for semantic clustering [21]. Adjusting the min_topic_size hyperparameter revealed that larger values improved coherence slightly but led to loss of granularity and omitted minority topics, which are essential in aviation safety data.

The higher alignment of BERTopic’s outputs with known aviation patterns, as recognized by expert reviewers, underscores its practical applicability in real-world aviation safety investigations. These findings highlight the nuanced trade-offs between algorithm complexity, interpretability, and statistical performance, reinforcing the need to balance quantitative metrics with domain relevance.

5. Discussion

The integration of domain experts in the manual evaluation process was instrumental in validating the practical relevance and clarity of the extracted topics. Their feedback substantiated the coherence and operational significance of themes such as Bird Strike Investigations and Helicopter Operations, which may not have emerged as salient from a purely statistical standpoint. The results revealed notable differences between the two topic modeling approaches evaluated. While pLSA demonstrated superior quantitative performance, exhibiting higher coherence and lower perplexity, manual inspection revealed limitations in the semantic cohesiveness of its topics. This discrepancy highlights the limitation of relying solely on statistical measures when interpretability and domain relevance are paramount, particularly in high-stakes fields such as aviation safety.

In contrast, BERTopic, despite exhibiting marginally lower coherence scores, produced thematically cohesive and contextually rich topics, including Engine Failures and Fuel Management Issues, which more directly reflect aviation-specific operational phenomena. This enhanced semantic quality can be attributed to BERTopic’s use of transformer-based embeddings in combination with class-based TF-IDF (c-TF-IDF), which facilitates the extraction of context-aware keyword groupings [28]. The model’s architecture enables it to capture intricate relationships among terms that are otherwise lost in traditional bag-of-words-based models such as pLSA. Furthermore, BERTopic’s provision of interactive visualizations, such as intertopic distance maps, enhances its applicability for aviation safety analysts who benefit from interpretable and intuitive representations of clustered incident data.

Despite these advantages, the computational simplicity and stronger coherence metrics of pLSA suggest that it may remain a useful tool in scenarios where computational efficiency or model transparency is prioritized. Accordingly, the findings of this study support the potential merit of a hybrid modeling framework that leverages the statistical robustness of pLSA alongside the semantic depth and interpretability of BERTopic. Such an approach may be especially beneficial for aviation safety teams aiming to strike a balance between algorithmic precision and actionable interpretability.

In comparison with the prior literature, our results align with findings by Grootendorst [28], who demonstrated BERTopic’s efficacy in producing coherent and human-interpretable topics, particularly in safety-critical and domain-specific contexts. Similarly, Lau et al. [32] and Blei et al. [33] underscored the limitations of probabilistic models such as LDA and pLSA in capturing nuanced semantic structures without contextual embeddings. The consistency between our results and these prior studies adds credibility to the present evaluation and substantiates the argument for adopting transformer-based approaches in aviation text analytics.

The implications for aviation safety analysis are substantial. The capacity of BERTopic to detect domain-relevant patterns, such as maintenance irregularities and adverse weather impacts, offers safety analysts a powerful tool for extracting latent insights from unstructured narrative reports. Furthermore, the model’s visual interpretability facilitates multidisciplinary engagement across engineering, human factors, and operational safety domains. Ultimately, the integration of context-aware topic modeling into safety management systems (SMSs) can enable more informed risk assessments, proactive safety interventions, and enhanced decision-making processes.

Therefore, while both models demonstrate utility in different operational contexts, the superior interpretability and domain relevance of BERTopic underscore its value as a central component in modern aviation safety analytics. When complemented by the statistical strengths of traditional models such as pLSA, a hybridized approach offers a promising avenue for extracting actionable knowledge from narrative safety data.

Limitations

Several limitations must be acknowledged in this study. First, the analysis is based solely on aviation safety narratives from the ASN, which may limit the generalizability of the findings. While ASN reports are rich in narrative detail, relying exclusively on this source could introduce dataset-specific biases, such as structural patterns, reporting practices, or terminology unique to ASN submissions. To address this, future work will validate the current findings by using additional datasets from other safety boards, including the ATSB and the NTSB. This cross-dataset validation will help assess the robustness and transferability of the topic modeling outcomes across different regional and institutional reporting styles.

Second, topic labels were manually interpreted, which may introduce subjective bias. However, this was mitigated through cross-validation by multiple domain experts to ensure consistency and domain relevance. Additionally, while coherence and perplexity are standard topic modeling metrics, they may not fully capture aviation-specific topic relevance [34].

Third, BERTopic’s reliance on transformer-based embeddings introduces a computational burden, potentially limiting its use in low-resource or real-time environments. Conversely, the pLSA model assumes topic independence and lacks semantic depth due to its non-use of contextual word embeddings. Lastly, this study did not explore downstream applications such as incident classification or risk prediction, which would offer further insight into the practical value of the generated topics.

6. Conclusions

This study conducted a comparative evaluation of two prominent topic modeling techniques BERTopic and pLSA, applied to aviation safety narratives. The findings indicate that BERTopic consistently outperformed pLSA across multiple dimensions. Quantitative analysis showed that BERTopic achieved greater topic diversity, while maintaining competitive perplexity values. From a qualitative standpoint, domain experts rated BERTopic-generated topics as more interpretable, actionable, and aligned with real-world aviation safety concerns. In contrast, while pLSA demonstrated strong performance on statistical metrics, it often failed to capture semantically coherent themes without contextual embedding. These results highlight the superior capability of transformer-based models for extracting meaningful insights from unstructured aviation texts. The practical implications of this research are significant. BERTopic’s ability to uncover latent safety themes with high interpretability makes it a valuable tool for proactive safety management, regulatory oversight, and decision support systems. Its visualization capabilities, such as intertopic distance maps, further enhance its utility for non-technical stakeholders.

Future research could expand this work by applying the approach to multilingual or cross-jurisdictional datasets, integrating topic modeling with predictive analytics, or exploring its role in real-time risk classification. Moreover, evaluating newer models, such as Top2Vec or embedding-enhanced LDA variants, and conducting cross-lingual benchmarking across global aviation databases offer promising directions for further innovation [35].

Finally, we envision extending this framework to support incident causality analysis, where topic modeling can help identify underlying systemic factors and precursors to accidents. Such insights are essential for building robust, data-driven aviation safety strategies at both organizational and policy levels.

Author Contributions

A.N.: conceptualization, methodology, software, data curation, validation, writing—original draft preparation, and formal analysis. K.J.: validation, writing—review and editing. U.T.: writing—review and editing. G.W.: data collection, supervision, and final draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported with funding from the UNSW Canberra faculty Tuition Fee Scholarship (TFS) scheme.

Data Availability Statement

The data that support the findings of this study are publicly available from the Aviation Safety Network (ASN) at https://aviation-safety.net/ (accessed on 25 December 2024). The dataset includes unstructured narrative reports and associated damage level classifications.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASN	Aviation Safety Network
ATSB	Australian Transport Safety Bureau
BERTopic	Bidirectional Encoder Representations from Transformers Topic Modeling
DL	Deep Learning
HDBSCAN	Hierarchical Density-Based Spatial Clustering of Applications with Noise
ML	Machine Learning
NLP	Natural Language Processing
NTSB	National Transportation Safety Board
pLSA	Probabilistic Latent Semantic Analysis
TF-IDF	Term Frequency-Inverse Document Frequency
UMAP	Uniform Manifold Approximation and Projection

References

Nanyonga, A.; Wild, G. Impact of Dataset Size & Data Source on Aviation Safety Incident Prediction Models with Natural Language Processing. In Proceedings of the 2023 Global Conference on Information Technologies and Communications (GCITC), Bengaluru, India, 1–3 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Applications of natural language processing in aviation safety: A review and qualitative analysis. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 2153. [Google Scholar]
Gupta, A.; Fatima, H.J.N. Topic modeling in healthcare: A survey study. NeuroQuantology 2022, 20, 6214–6221. [Google Scholar]
Rawat, A.J.; Ghildiyal, S.; Dixit, A.K. Topic Modeling Techniques for Document Clustering and Analysis of Judicial Judgements. Int. J. Eng. Trends Technol. 2022, 70, 163–169. [Google Scholar] [CrossRef]
Apishev, M.; Koltcov, S.; Koltsova, O.; Nikolenko, S.; Vorontsov, K. Additive regularization for topic modeling in sociological studies of user-generated texts. In Advances in Computational Intelligence: 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Cancún, Mexico, October, 23–28, 2016, Proceedings, Part I 15; Springer: Berlin/Heidelberg, Germany, 2017; pp. 169–184. [Google Scholar]
Axelborn, H.; Berggren, J. Topic Modeling for Customer Insights: A Comparative Analysis of LDA and BERTopic in Categorizing Customer Calls. Master’s Thesis, Umea University, Umea, Sweden, 2023. [Google Scholar]
Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment. Technologies 2025, 13, 209. [Google Scholar] [CrossRef]
Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the UAI, Stockholm, Sweden, 30 July–1 August 1999; pp. 289–296. [Google Scholar]
Mu, Y.; Dong, C.; Bontcheva, K.; Song, X. Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv 2024, arXiv:2403.16248. [Google Scholar]
dos Santos, J.A.; Syed, T.I.; Naldi, M.C.; Campello, R.J.; Sander, J. Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data 2019, 7, 102–114. [Google Scholar] [CrossRef]
Masseroli, M.; Chicco, D.; Pinoli, P. Probabilistic latent semantic analysis for prediction of gene ontology annotations. In Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012; IEEE: New York, NY, USA, 2012; pp. 1–8. [Google Scholar]
Wahabzada, M.; Kersting, K. Larger residuals, less work: Active document scheduling for latent Dirichlet allocation. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, 5–9 September, 2011, Proceedings, Part III 22; Springer: Berlin/Heidelberg, Germany, 2011; pp. 475–490. [Google Scholar]
Nanyonga, A.; Wild, G. Analyzing Aviation Safety Narratives with LDA, NMF and PLSA: A Case Study Using Socrata Datasets. arXiv 2025, arXiv:2501.01690. [Google Scholar]
Rusakovica, J.; Hallinan, J.; Wipat, A.; Zuliani, P.J. Probabilistic latent semantic analysis applied to whole bacterial genomes identifies common genomic features. J. Integr. Bioinform. 2014, 11, 93–105. [Google Scholar] [CrossRef]
La Rosa, M.; Fiannaca, A.; Rizzo, R.; Urso, A. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinform. 2015, 16, S2. [Google Scholar] [CrossRef]
Dumais, S.T. LSA and information retrieval: Getting back to basics. In Handbook of Latent Semantic Analysis; Psychology Press: London, UK, 2007; pp. 305–334. [Google Scholar]
Albanese, N.C. Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: A Comparison. Towards Data Sci. 2022, 19. [Google Scholar]
Xu, S.; Wang, Y.; Cheng, X.; Yang, Q. Thematic Identification Analysis of Equipment Quality Problems Based on the BERTopic Model. In Proceedings of the 2024 6th Management Science Informatization and Economic Innovation Development Conference (MSIEID 2024), Guangzhou, China, 6–8 December 2025; Atlantis Press: Dordrecht, The Netherlands, 2025; pp. 484–491. [Google Scholar]
Sibitenda, H.; Diattara, A.; Traore, A.; Hu, R.; Zhang, D.; Rundensteiner, E.; Ba, C. Extracting Semantic Topics about Development in Africa from Social Media. IEEE Access 2024, 12, 142343–142359. [Google Scholar] [CrossRef]
Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Comparative Analysis of Topic Modeling Techniques on ATSB Text Narratives Using Natural Language Processing. In Proceedings of the 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 1–3 March 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Kim, Y.; Kim, H. An Analysis of Research Trends on the Metaverse Using BERTopic Modeling. Int. J. Contents 2023, 19, 61–72. [Google Scholar] [CrossRef]
Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging state-of-the-art topic modeling for news impact analysis on financial markets: A comparative study. Electronics 2023, 12, 2605. [Google Scholar] [CrossRef]
Egger, R.; Yu, J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Exploring Aviation Incident Narratives Using Topic Modeling and Clustering Techniques. In Proceedings of the 2024 IEEE Region 10 Symposium (TENSYMP), New Delhi, India, 27–29 September 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Agovic, A.; Shan, H.; Banerjee, A. Analyzing Aviation Safety Reports: From Topic Modeling to Scalable Multi-Label Classification. In Proceedings of the CIDU, Mountain View, CA, USA, 5–6 October 2010; Citeseer: Princeton, NJ, USA, 2010; pp. 83–97. [Google Scholar]
Gefen, D.; Endicott, J.E.; Fresneda, J.E.; Miller, J.; Larsen, K.R. A guide to text analysis with latent semantic analysis in R with annotated code: Studying online reviews and the stack exchange community. Commun. Assoc. Inf. Syst. 2017, 41, 21. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Ibraimoh, R.; Debrah, K.O.; Nwambuonwo, E. Developing & Comparing Various Topic Modeling Algorithms on a Stack Overflow Dataset. IRE J. 2024, 8, 243–253. [Google Scholar]
Deb, S.; Chanda, A.K. Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data. Mach. Learn. Appl. 2022, 7, 100253. [Google Scholar] [CrossRef]
Hoyle, A.; Goel, P.; Hian-Cheong, A.; Peskov, D.; Boyd-Graber, J.; Resnik, P. Is automated topic model evaluation broken? the incoherence of coherence. Adv. Neural Inf. Process. Syst. 2021, 34, 2018–2033. [Google Scholar]
Lau, J.H.; Newman, D.; Baldwin, T. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, 26–30 April 2014; pp. 530–539. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Röder, M.; Both, A.; Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining, Shanghai, China, 31 January–6 February 2015; pp. 399–408. [Google Scholar]
Angelov, D. Top2vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]

Figure 1. Methodological framework.

Figure 2. Top words for each topic chosen by the pLSA model.

Figure 3. Top words for each topic chosen by the BERTopic model.

Figure 4. BERTopic’s interactive intertopic distance map.

Figure 5. Top words for each topic chosen by BERTopic.

Figure 6. Top words for each topic chosen by PLSA.

Figure 7. Word cloud for BERTopic.

Figure 8. Word cloud for pLSA.

Figure 9. Topic word score for pLSA.

Table 1. Top words from BERTopic and pLSA for each topic, and their associated theme.

Topic	BERTopic Top 10 Words	pLSA Top 10 Words	Theme/Single Word
0	caravan, grand, cessna, forced, simikot, near, airstrip, impacted, terrain, pilot	aircraft, flight, pilot, feet, crew, ft, runway, approach, landing, Airport	Small Aircraft and Flight Basics
1	illegal, venezuelan, drugs, venezuela, mexican, mexico, colombian, jet, guatemala, xb	airplane, aircraft, landing, flight, pilot, engine, left, runway, crew, drugs	Drug Trafficking and Flight Landing
2	otter, twin, servo, elevator, nancova, tourmente, dq, ononge, hinge, col	aircraft, crew, flight, feet, airplane, right, wing, damage, landing, San	Aircraft Parts and Flight Damage
3	fire, smoke, extinguished, parked, fireball, cargo, bottles, heat, emanating, rescue	fire, aircraft, runway, flight, airplane, plane, engine, Airport, landing, right	Fire Incident and Flight
4	caught, fire, canadair, erupted, repair, huatulco, hockey, arson, forced, providence	aircraft, landing, flight, runway, gear, Airport, pilot, left, crew, right	Fire Event and Landing Gear
5	tornado, blown, substantially, tune, storm, hangered, nashville, damaged, tennessee, struck	runway, aircraft, flight, Airport, landing, pilot, airplane, right, left, damage	Tornado Damage and Runway Incident
6	learjet, paso, toluca, mateo, olbia, iwakuni, mexico, cancn, vor, michelena	gear, landing, main, crew, aircraft, flight, left, runway, pilot, Airport	Jet and Airports and Gear and Emergency
7	bird, birds, flock, strike, windshield, geese, remains, roskilde, spar, multiple	aircraft, Airport, Air, flight, runway, approach, airplane, crew, accident, crashed	Bird Strike and Accidents
8	havana, cuba, bogot, medelln, rionegro, permission, carreo, tulcn, haiti, lamia	runway, airplane, flight, pilot, crew, left, landing, aircraft, right, approach	Latin America and Runway and Flight
9	medan, tower, supervisor, acted, rendani, indonesia, pk, controller, ende, jalaluddin	aircraft, right, tornado, wing, parked, hangar, crew, hand, left, landing	Air Traffic Control and Tornado Damage

Table 2. Comparison of model strengths and weaknesses.

Property	BERTopic	pLSA
Interpretability	High	Moderate
Coherence	0.531	0.7634
Perplexity	−4.532	−4.6237
Granularity Control	Adjustable	Fixed
Computational Cost	Higher	Lower
Visualization	Strong (via UMAP)	Limited

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA. Aerospace 2025, 12, 551. https://doi.org/10.3390/aerospace12060551

AMA Style

Nanyonga A, Joiner K, Turhan U, Wild G. Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA. Aerospace. 2025; 12(6):551. https://doi.org/10.3390/aerospace12060551

Chicago/Turabian Style

Nanyonga, Aziida, Keith Joiner, Ugur Turhan, and Graham Wild. 2025. "Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA" Aerospace 12, no. 6: 551. https://doi.org/10.3390/aerospace12060551

APA Style

Nanyonga, A., Joiner, K., Turhan, U., & Wild, G. (2025). Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA. Aerospace, 12(6), 551. https://doi.org/10.3390/aerospace12060551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Collection

3.2. Data Processing

3.3. Topic Modeling Procedure

3.4. Evaluation Metrics

3.5. Experimental Setup

4. Results

4.1. Model Performance Metrics

4.2. Comparative Topic Analysis

4.2.1. Topic Words and Thematic Labels

4.2.2. Visualization and Interpretability

4.3. Word Clouds and Topic Distribution

4.4. Evaluation of Model Properties

4.5. Ablation Study

5. Discussion

Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI