Skip Content
You are currently on the new version of our website. Access the old version .
InformaticsInformatics
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

30 January 2026

Exploring Scientific Literature Using Topic Modeling: A Practical Framework for Discovery and Classification

,
,
and
1
Department of Statistics and Data Science, University of Central Florida, Orlando, FL 32816, USA
2
National Center for Forensic Science, University of Central Florida, P.O. Box 162367, Orlando, FL 32816-2367, USA
3
Department of Chemistry, University of Central Florida, Orlando, FL 32816, USA
*
Author to whom correspondence should be addressed.
This article belongs to the Section Big Data Mining and Analytics

Abstract

The increasing volume and diversity of scientific publications poses challenges for scalable and interpretable topic discovery and automated document categorization. This study proposes an integrated framework that combines probabilistic topic modeling with supervised classification to support large-scale scientific literature analysis. Using 3689 abstracts from the Journal of Forensic Sciences (2009–2022), Latent Dirichlet Allocation (LDA) is applied to uncover latent thematic structures, assess topic diagnosticity across forensic disciplines, and analyze temporal research trends. Bayesian model selection with repeated resampling identifies a stable topic resolution, with the number of topics T lying in the range 83--88, yielding semantically coherent and discipline-aligned topics. The resulting document–topic representations are then used for supervised abstract classification. Across multiple models and resampling scenarios, the strongest and most stable performance is achieved under a Grouped Category configuration. In particular, XGBoost attains an Accuracy of 0.754 and a Macro-averaged F1 score of 0.737 at T=88, with comparable results at neighboring topic counts, indicating robustness to topic granularity. Overall, the proposed framework provides a reproducible, interpretable, and computationally efficient pipeline for literature organization, trend analysis, and metadata enhancement in scientific domains.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.