Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA
Abstract
:1. Introduction
2. Related Work
3. Materials and Methods
3.1. Data Collection
3.2. Data Processing
3.3. Topic Modeling Procedure
3.4. Evaluation Metrics
3.5. Experimental Setup
4. Results
4.1. Model Performance Metrics
4.2. Comparative Topic Analysis
4.2.1. Topic Words and Thematic Labels
4.2.2. Visualization and Interpretability
4.3. Word Clouds and Topic Distribution
4.4. Evaluation of Model Properties
4.5. Ablation Study
5. Discussion
Limitations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ASN | Aviation Safety Network |
ATSB | Australian Transport Safety Bureau |
BERTopic | Bidirectional Encoder Representations from Transformers Topic Modeling |
DL | Deep Learning |
HDBSCAN | Hierarchical Density-Based Spatial Clustering of Applications with Noise |
ML | Machine Learning |
NLP | Natural Language Processing |
NTSB | National Transportation Safety Board |
pLSA | Probabilistic Latent Semantic Analysis |
TF-IDF | Term Frequency-Inverse Document Frequency |
UMAP | Uniform Manifold Approximation and Projection |
References
- Nanyonga, A.; Wild, G. Impact of Dataset Size & Data Source on Aviation Safety Incident Prediction Models with Natural Language Processing. In Proceedings of the 2023 Global Conference on Information Technologies and Communications (GCITC), Bengaluru, India, 1–3 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
- Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Applications of natural language processing in aviation safety: A review and qualitative analysis. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 2153. [Google Scholar]
- Gupta, A.; Fatima, H.J.N. Topic modeling in healthcare: A survey study. NeuroQuantology 2022, 20, 6214–6221. [Google Scholar]
- Rawat, A.J.; Ghildiyal, S.; Dixit, A.K. Topic Modeling Techniques for Document Clustering and Analysis of Judicial Judgements. Int. J. Eng. Trends Technol. 2022, 70, 163–169. [Google Scholar] [CrossRef]
- Apishev, M.; Koltcov, S.; Koltsova, O.; Nikolenko, S.; Vorontsov, K. Additive regularization for topic modeling in sociological studies of user-generated texts. In Advances in Computational Intelligence: 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Cancún, Mexico, October, 23–28, 2016, Proceedings, Part I 15; Springer: Berlin/Heidelberg, Germany, 2017; pp. 169–184. [Google Scholar]
- Axelborn, H.; Berggren, J. Topic Modeling for Customer Insights: A Comparative Analysis of LDA and BERTopic in Categorizing Customer Calls. Master’s Thesis, Umea University, Umea, Sweden, 2023. [Google Scholar]
- Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Does the Choice of Topic Modeling Technique Impact the Interpretation of Aviation Incident Reports? A Methodological Assessment. Technologies 2025, 13, 209. [Google Scholar] [CrossRef]
- Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the UAI, Stockholm, Sweden, 30 July–1 August 1999; pp. 289–296. [Google Scholar]
- Mu, Y.; Dong, C.; Bontcheva, K.; Song, X. Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv 2024, arXiv:2403.16248. [Google Scholar]
- dos Santos, J.A.; Syed, T.I.; Naldi, M.C.; Campello, R.J.; Sander, J. Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data 2019, 7, 102–114. [Google Scholar] [CrossRef]
- Masseroli, M.; Chicco, D.; Pinoli, P. Probabilistic latent semantic analysis for prediction of gene ontology annotations. In Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012; IEEE: New York, NY, USA, 2012; pp. 1–8. [Google Scholar]
- Wahabzada, M.; Kersting, K. Larger residuals, less work: Active document scheduling for latent Dirichlet allocation. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, 5–9 September, 2011, Proceedings, Part III 22; Springer: Berlin/Heidelberg, Germany, 2011; pp. 475–490. [Google Scholar]
- Nanyonga, A.; Wild, G. Analyzing Aviation Safety Narratives with LDA, NMF and PLSA: A Case Study Using Socrata Datasets. arXiv 2025, arXiv:2501.01690. [Google Scholar]
- Rusakovica, J.; Hallinan, J.; Wipat, A.; Zuliani, P.J. Probabilistic latent semantic analysis applied to whole bacterial genomes identifies common genomic features. J. Integr. Bioinform. 2014, 11, 93–105. [Google Scholar] [CrossRef]
- La Rosa, M.; Fiannaca, A.; Rizzo, R.; Urso, A. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinform. 2015, 16, S2. [Google Scholar] [CrossRef]
- Dumais, S.T. LSA and information retrieval: Getting back to basics. In Handbook of Latent Semantic Analysis; Psychology Press: London, UK, 2007; pp. 305–334. [Google Scholar]
- Albanese, N.C. Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: A Comparison. Towards Data Sci. 2022, 19. [Google Scholar]
- Xu, S.; Wang, Y.; Cheng, X.; Yang, Q. Thematic Identification Analysis of Equipment Quality Problems Based on the BERTopic Model. In Proceedings of the 2024 6th Management Science Informatization and Economic Innovation Development Conference (MSIEID 2024), Guangzhou, China, 6–8 December 2025; Atlantis Press: Dordrecht, The Netherlands, 2025; pp. 484–491. [Google Scholar]
- Sibitenda, H.; Diattara, A.; Traore, A.; Hu, R.; Zhang, D.; Rundensteiner, E.; Ba, C. Extracting Semantic Topics about Development in Africa from Social Media. IEEE Access 2024, 12, 142343–142359. [Google Scholar] [CrossRef]
- Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Comparative Analysis of Topic Modeling Techniques on ATSB Text Narratives Using Natural Language Processing. In Proceedings of the 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 1–3 March 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Kim, Y.; Kim, H. An Analysis of Research Trends on the Metaverse Using BERTopic Modeling. Int. J. Contents 2023, 19, 61–72. [Google Scholar] [CrossRef]
- Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging state-of-the-art topic modeling for news impact analysis on financial markets: A comparative study. Electronics 2023, 12, 2605. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
- Nanyonga, A.; Wasswa, H.; Turhan, U.; Joiner, K.; Wild, G. Exploring Aviation Incident Narratives Using Topic Modeling and Clustering Techniques. In Proceedings of the 2024 IEEE Region 10 Symposium (TENSYMP), New Delhi, India, 27–29 September 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Agovic, A.; Shan, H.; Banerjee, A. Analyzing Aviation Safety Reports: From Topic Modeling to Scalable Multi-Label Classification. In Proceedings of the CIDU, Mountain View, CA, USA, 5–6 October 2010; Citeseer: Princeton, NJ, USA, 2010; pp. 83–97. [Google Scholar]
- Gefen, D.; Endicott, J.E.; Fresneda, J.E.; Miller, J.; Larsen, K.R. A guide to text analysis with latent semantic analysis in R with annotated code: Studying online reviews and the stack exchange community. Commun. Assoc. Inf. Syst. 2017, 41, 21. [Google Scholar] [CrossRef]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Ibraimoh, R.; Debrah, K.O.; Nwambuonwo, E. Developing & Comparing Various Topic Modeling Algorithms on a Stack Overflow Dataset. IRE J. 2024, 8, 243–253. [Google Scholar]
- Deb, S.; Chanda, A.K. Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data. Mach. Learn. Appl. 2022, 7, 100253. [Google Scholar] [CrossRef]
- Hoyle, A.; Goel, P.; Hian-Cheong, A.; Peskov, D.; Boyd-Graber, J.; Resnik, P. Is automated topic model evaluation broken? the incoherence of coherence. Adv. Neural Inf. Process. Syst. 2021, 34, 2018–2033. [Google Scholar]
- Lau, J.H.; Newman, D.; Baldwin, T. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, 26–30 April 2014; pp. 530–539. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Röder, M.; Both, A.; Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining, Shanghai, China, 31 January–6 February 2015; pp. 399–408. [Google Scholar]
- Angelov, D. Top2vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]
Topic | BERTopic Top 10 Words | pLSA Top 10 Words | Theme/Single Word |
---|---|---|---|
0 | caravan, grand, cessna, forced, simikot, near, airstrip, impacted, terrain, pilot | aircraft, flight, pilot, feet, crew, ft, runway, approach, landing, Airport | Small Aircraft and Flight Basics |
1 | illegal, venezuelan, drugs, venezuela, mexican, mexico, colombian, jet, guatemala, xb | airplane, aircraft, landing, flight, pilot, engine, left, runway, crew, drugs | Drug Trafficking and Flight Landing |
2 | otter, twin, servo, elevator, nancova, tourmente, dq, ononge, hinge, col | aircraft, crew, flight, feet, airplane, right, wing, damage, landing, San | Aircraft Parts and Flight Damage |
3 | fire, smoke, extinguished, parked, fireball, cargo, bottles, heat, emanating, rescue | fire, aircraft, runway, flight, airplane, plane, engine, Airport, landing, right | Fire Incident and Flight |
4 | caught, fire, canadair, erupted, repair, huatulco, hockey, arson, forced, providence | aircraft, landing, flight, runway, gear, Airport, pilot, left, crew, right | Fire Event and Landing Gear |
5 | tornado, blown, substantially, tune, storm, hangered, nashville, damaged, tennessee, struck | runway, aircraft, flight, Airport, landing, pilot, airplane, right, left, damage | Tornado Damage and Runway Incident |
6 | learjet, paso, toluca, mateo, olbia, iwakuni, mexico, cancn, vor, michelena | gear, landing, main, crew, aircraft, flight, left, runway, pilot, Airport | Jet and Airports and Gear and Emergency |
7 | bird, birds, flock, strike, windshield, geese, remains, roskilde, spar, multiple | aircraft, Airport, Air, flight, runway, approach, airplane, crew, accident, crashed | Bird Strike and Accidents |
8 | havana, cuba, bogot, medelln, rionegro, permission, carreo, tulcn, haiti, lamia | runway, airplane, flight, pilot, crew, left, landing, aircraft, right, approach | Latin America and Runway and Flight |
9 | medan, tower, supervisor, acted, rendani, indonesia, pk, controller, ende, jalaluddin | aircraft, right, tornado, wing, parked, hangar, crew, hand, left, landing | Air Traffic Control and Tornado Damage |
Property | BERTopic | pLSA |
---|---|---|
Interpretability | High | Moderate |
Coherence | 0.531 | 0.7634 |
Perplexity | −4.532 | −4.6237 |
Granularity Control | Adjustable | Fixed |
Computational Cost | Higher | Lower |
Visualization | Strong (via UMAP) | Limited |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nanyonga, A.; Joiner, K.; Turhan, U.; Wild, G. Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA. Aerospace 2025, 12, 551. https://doi.org/10.3390/aerospace12060551
Nanyonga A, Joiner K, Turhan U, Wild G. Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA. Aerospace. 2025; 12(6):551. https://doi.org/10.3390/aerospace12060551
Chicago/Turabian StyleNanyonga, Aziida, Keith Joiner, Ugur Turhan, and Graham Wild. 2025. "Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA" Aerospace 12, no. 6: 551. https://doi.org/10.3390/aerospace12060551
APA StyleNanyonga, A., Joiner, K., Turhan, U., & Wild, G. (2025). Semantic Topic Modeling of Aviation Safety Reports: A Comparative Analysis Using BERTopic and PLSA. Aerospace, 12(6), 551. https://doi.org/10.3390/aerospace12060551