Exploring Scientific Literature Using Topic Modeling: A Practical Framework for Discovery and Classification

Amir Alipour Yengejeh; Larry Tang; Candice M. Bridge; Chandra Kundu

doi:10.3390/informatics13020024

,

and

¹

Department of Statistics and Data Science, University of Central Florida, Orlando, FL 32816, USA

²

National Center for Forensic Science, University of Central Florida, P.O. Box 162367, Orlando, FL 32816-2367, USA

³

Department of Chemistry, University of Central Florida, Orlando, FL 32816, USA

^*

Author to whom correspondence should be addressed.

Informatics2026, 13(2), 24;https://doi.org/10.3390/informatics13020024

This article belongs to the Section Big Data Mining and Analytics

Version Notes

Order Reprints

Abstract

The increasing volume and diversity of scientific publications poses challenges for scalable and interpretable topic discovery and automated document categorization. This study proposes an integrated framework that combines probabilistic topic modeling with supervised classification to support large-scale scientific literature analysis. Using 3689 abstracts from the Journal of Forensic Sciences (2009–2022), Latent Dirichlet Allocation (LDA) is applied to uncover latent thematic structures, assess topic diagnosticity across forensic disciplines, and analyze temporal research trends. Bayesian model selection with repeated resampling identifies a stable topic resolution, with the number of topics T lying in the range

83 - - 88

, yielding semantically coherent and discipline-aligned topics. The resulting document–topic representations are then used for supervised abstract classification. Across multiple models and resampling scenarios, the strongest and most stable performance is achieved under a Grouped Category configuration. In particular, XGBoost attains an Accuracy of 0.754 and a Macro-averaged F1 score of 0.737 at

T = 88

, with comparable results at neighboring topic counts, indicating robustness to topic granularity. Overall, the proposed framework provides a reproducible, interpretable, and computationally efficient pipeline for literature organization, trend analysis, and metadata enhancement in scientific domains.

Keywords:

topic modeling; latent dirichlet allocation; scientific literature analysis; text classification; supervised learning; temporal trend analysis; metadata enrichment; forensic science

Exploring Scientific Literature Using Topic Modeling: A Practical Framework for Discovery and Classification

Abstract

Article Metrics

Citations

Article Access Statistics