SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling

Drissi, Amani; Sassi, Salma; Chbeir, Richard; Tissaoui, Anis; Jemai, Abderrazek

doi:10.3390/computers14090400

Open AccessArticle

SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling

by

Amani Drissi

^1,2

,

Salma Sassi

^3,4

,

Richard Chbeir

^2,*

,

Anis Tissaoui

⁴

and

Abderrazek Jemai

⁵

¹

SERCOM Laboratory, Faculty of Sciences of Tunis (FST), University of Tunis El Manar, Tunis 2092, Tunisia

²

Computer Science Research Laboratory of the University of Pau & Adour Countries (LIUPPA), Université de Pau et des Pays de l’Adour, Avenue de l’Université, BP 576, 64012 Pau, France

³

Department of Computer Engineering, Central School of Electronics, 6 rue Bonnefoi, 69003 Lyon, France

⁴

Valorization of Natural and Cultural Heritage (VPNC) Laboratory, Faculty of Legal, Economic and Management Sciences of Jendouba (FSJEGJ), University of Jendouba, Av. de l’UMA, Jendouba 8189, Tunisia

⁵

SERCOM Laboratory, National Institute Applied Science and Technologie (INSAT), University of Carthage, Av. de la République, Carthage 1054, Tunisia

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(9), 400; https://doi.org/10.3390/computers14090400

Submission received: 18 July 2025 / Revised: 22 August 2025 / Accepted: 16 September 2025 / Published: 19 September 2025

(This article belongs to the Special Issue Advances in Semantic Multimedia and Personalized Digital Content)

Download

Browse Figures

Versions Notes

Abstract

Topic modeling is a crucial technique for Natural Language Processing (NLP) which helps to automatically uncover coherent topics from large-scale text corpora. Yet, classic methods tend to suffer from poor semantic depth and topic coherence. In this regard, we present here a new approach “

S e m a T o p i c

” to improve the quality and interpretability of discovered topics. By exploiting semantic understanding and stronger clustering dynamics, our approach results in a more continuous, finer and more stable representation of the topics. Experimental results demonstrate that

S e m a T o p i c

achieves a relative gain of +6.2% in semantic coherence compared to BERTopic on the 20 Newsgroups dataset (

C_{v} = 0.5315

vs. 0.5004), while maintaining stable performance across heterogeneous and multilingual corpora. These findings highlight “

S e m a T o p i c

” as a scalable and reliable solution for practical text mining and knowledge discovery.

Keywords:

coherent topics; large language models; topic modeling; document classification; machine learning; LDA; semantic representation; natural language processing

1. Introduction

In the modern age of digital communication, there is an increasing amount of textual data being generated across platforms, creating a pressing need for advanced Natural Language Processing (NLP) techniques to process and extract valuable information from unstructured content efficiently. Among these, topic modeling is one of the fundamental techniques for discovering latent thematic patterns from a large-scale text corpus. One of its benefits is automatically discovering and labeling latent (hidden) topics which in turn support different applications such as document classification, intelligent information retrieval, automatic summarization, and sentiment analysis. However, the usefulness of these applications is strongly related to whether the generated topics are interpretable, which could be evaluated based on the semantic coherence of the extracted topics and their understanding by humans.

Conventional topic modeling methods, particularly Latent Dirichlet Allocation (LDA) [1], have been widely adopted due to their probabilistic nature, which models documents as mixtures of hidden topics. Their intuitive formulation and computational efficiency have made LDA a standard tool for unsupervised text analysis. However, despite its popularity, LDA suffers from significant limitations in real-world applications. At its core, it relies on word co-occurrence counts under the ’bag-of-words’ assumption, which ignores the relationships between words, their semantic meanings, and syntactic context [2,3,4]. Consequently, words are treated as independent units, disconnected from the grammatical or discourse-level context in which they occur. As a result, top-ranked terms within a topic are often only loosely related and lack semantic coherence, which makes human interpretation difficult. This restriction significantly reduces the model’s effectiveness in downstream tasks, such as document classification, summarization, and semantic indexing, which require clear and meaningful topic representations. In practice, users frequently encounter topics filled with statistically co-occurring but semantically fragmented terms, which obscure the underlying topics and decrease the actionability of the model’s output. In addition, interpretability of topic models is not just a desirable property, but also an essential necessity, especially in applications like healthcare, finance, law, etc., where decisions are based on these interpretable outcomes. Lack of interpretability not only damages user trust, it also restricts the practical application of topic modeling systems. While LDA [1] is still computationally attractive, its underlying assumption does not allow it to model the complex and context-sensitive nature of language [4]. This highlights the need for advanced models that can incorporate more complex semantic structures to produce topics, which are not only statistically meaningful but also meaningful and comprehensible to human users.

In order to address the limitations associated with the structure and semantics of traditional topic modeling methods, current Natural Language Processing (NLP) has evolved towards embedding-based approaches that leverage pre-trained deep learning models to generate context-aware text representations. This evolution has significantly improved topic models making them more interpretable and of higher quality, through integrating richer semantic structures into the modeling process. Models like BERT (Bidirectional Encoder Representations from Transformers) [5], RoBERTa [6], and SBERT (Sentence-BERT) [7] offer contextualized embeddings that capture complex syntactic and semantic relationships. Unlike static embeddings [8], these models dynamically adjust word representations based on their context, enabling a more nuanced interpretation. As a result, embedding-based topic models improve both the semantic coherence of discovered topics and the performance of downstream tasks such as classification, clustering, and content recommendation marking a major advancement in human-centered text analysis. Additionally, contextual embeddings dynamically modulate representations depending on how they are employed within a particular textual setting. This dynamic approach allows the disambiguation of the polysemous words and a more accurate encoding of the discursive meaning. The incorporation of these deeper semantic features has remarkably promoted topic modeling for extracting semantically coherent and enriched topics.

Although embedding-based and contextualized topic modeling approaches (such as BERTopic [3] and Top2Vec [9]) are more powerful than traditional probabilistic models, they also have their own limitations. Despite being powerful in terms of learning contextual semantics, these models are still missing a probabilistic framework that describes the relationships among documents, topics, and words explicitly. This restricts their capacity to generate interpretable outputs like document–topic and topic–word distributions, which are essential to clarify how topics are generated, and how they correlate with content. Thus, the theoretical foundation and analytical transparency of the obtained topics are still limited. Furthermore, these models are often very sensitive to the selection of related hyperparameters for clustering and dimensional transformations, which can lead to variability in topic coherence and instability between different runs. Although c-TF-IDF improves topic term extraction in BERTopic [3] by integrating cluster-level frequency distributions, it is still frequency-based and does not consider semantics relationships or dependencies between keywords. The above limitation results in the presence of statistically significant but semantically incoherent words in extracted topics, and may decrease the overall semantic coherence and interpretability of the extracted topics in a semantically challenging or ambiguous domain. In this paper, we introduce

S e m a T o p i c

, a novel hybrid framework that combines contextual embeddings, semantic clustering, and a coherent optimized probabilistic model, to address the limitations of traditional and embedding-based topics models. Our proposed approach improves the ability of topic models to integrate semantic structure with probabilistic interpretability, showing consistent improvements in coherence and stability over competitive baselines, while remaining robust across different domains. By combining semantic clustering with coherent-based topic modeling, our proposed framework is able to go beyond word co-occurrence patterns on the surface to create semantically meaningful, contextually-based, and structurally coherent representations of topics. This richer integration at the semantic level enables

S e m a T o p i c

to better capture the underlying discourse structure in the extracted topics, which are not only more distinctive and interpretable, but also resilient to domain change.

S e m a T o p i c

’s layered architecture allows for accurate and reliable topic extraction, which makes it ideal for practical natural language understanding applications.

Unlike prior models,

S e m a T o p i c

incorporates an automated coherence-driven hyperparameter tuning mechanism that systematically searches over

(α, β, K)

to maximize semantic coherence. This avoids manual trial-and-error, ensures reproducibility, and guarantees optimal parameter settings for each cluster, thereby improving both topic quality and stability. In addition, we conduct a comprehensive evaluation across multiple datasets of varying domains and sizes, compare against several strong baselines (including BERTopic [3], LDA+ [10], Top2Vec [9], ETM [11] and LDA2Vec [12] ), and complement automatic metrics with human evaluation to validate interpretability. An analysis of execution time per pipeline stage is also provided to assess scalability and computational trade-offs.

To achieve our objectives, we address the following challenges:

How can context-sensitive semantic representation be integrated into topic models to generate more coherent, interpretable, and semantically meaningful topics?
How can topic stability be ensured when the model is applied to evolving or partially modified corpora?

The rest of the paper is organized as follows. Section 2 reviews existing topic modeling methods and highlights their limitations. Section 3 details our proposed

S e m a T o p i c

methodology, including document embedding generation, clustering, and topic extraction. Section 4 describes the experiments conducted to validate our approach. Finally, Section 5 concludes the paper and shows some future directions.

2. Related Work

The automated classification and topical structuring of text corpora has been a central problem in the area of Natural Language Processing (NLP) for a long time. This has driven the development of diverse methodologies, each with specific advantages and limitations. Mostly adopted, traditional probabilistic models, particularly Latent Dirichlet Allocation (LDA), have achieved widespread popularity for the task of identifying hidden thematic structures in corpora of textual documents [1]. However, these models rely primarily on word co-occurrence statistics, and they often disregard linguistic nuances such as syntax and context. As a result, related approaches frequently produce ambiguous, semantically weak topics, reducing their interpretability and usefulness in real applications. In response to these limitations, recent studies are increasingly adopting semantics into topic models. Thus, the idea is to enhance the semantic coherence, interpretability, and semantic relevance of the explored topics. This has generated a number of approaches that go beyond simple word co-occurrence and use richer representations such as word embeddings, context-sensitive models, graph-based methods and neural methods to get to better ways to represent meaning in text. To provide a clearer analysis of the state-of-the-art studies, we organize the existing approaches into four main categories: (1) static embedding-based approaches, (2) contextual embedding-based approaches, (3) graph-based approaches and (4) neural-based approaches.

This categorization helps to demonstrate how different approaches surpass conventional models, simplifying comprehension of their particular objectives, advantages, and constraints in generating more cohesive and significant topics.

Static Embedding-based Approaches: Static word embeddings were introduced as an improvement over the bag-of-words (BoW) representation. Methods such as Word2Vec [8] and GloVe [13] learn low-dimensional, continuous vectors for words from large corpora by exploiting distributional statistics. The resulting fixed-length embeddings place semantically related words close to one another in the vector space, providing a compact representation of lexical meaning beyond sparse co-occurrence counts.
To enhance the semantic quality of text representations and improve topic modeling, several approaches have incorporated static embeddings into probabilistic frameworks, particularly Latent Dirichlet Allocation (LDA). A notable example is LDA2Vec [12], which integrates Word2Vec embeddings into the LDA architecture. This hybrid model retains the probabilistic structure of LDA while leveraging distributional semantics to enhance topic cohesion. By aligning topic inference with word-level semantic proximity, it produces more coherent and interpretable topic representations, characterized by reduced lexical noise and more precise grouping of related terms. Consequently, the extracted latent themes are finer-grained, bridging statistical modeling with semantic awareness.
The Embedded Topic Model (ETM), proposed in [11], provides a unified framework for jointly learning word and topic embeddings within the same semantic space. This integration reinforces the coherence and quality of extracted topics, particularly in sparse or short-text corpora. ETM strikes a balance between probabilistic modeling and semantic interpretation by constraining topic–word distributions to remain consistent with continuous word representations, thereby enhancing interpretability and semantic expressiveness.
Although static embedding-based approaches such as LDA2Vec [12] and the Embedded Topic Model (ETM) [11] offer improvements over traditional bag-of-words models, they remain unable to capture context-dependent word meanings. Each word is assigned a single fixed vector regardless of usage, which causes polysemous words to be represented identically across contexts and leads to ambiguous semantic representations. For topic modeling, where precise clustering of semantically related terms is crucial, this context insensitivity poses a serious limitation to interpretability. Consequently, topics may include words that are statistically similar in the embedding space but unrelated in specific documents. Moreover, static embeddings are not adapted to domain-specific usage or linguistic variation, making them less robust for dynamic or heterogeneous corpora. Thus, despite the advances of LDA2Vec [12] and ETM [11], the reliance on static embeddings ultimately restricts their semantic expressiveness and applicability.
The authors of [14] propose a topic extraction method in which k-means [15] is directly applied to word vectors, followed by document-weighted refinement and PCA-based dimensionality reduction [16], thereby bypassing the complexity of generative models. This approach achieves faster execution and improved scalability, yielding semantic coherence scores comparable to traditional models such as LDA [1], and is particularly well-suited for resource-limited scenarios. However, it suffers from key drawbacks: it lacks a probabilistic foundation and does not provide interpretable document–topic or topic–word distributions, which are often required in downstream applications. Moreover, its reliance on static embeddings prevents it from disambiguating polysemous words or adapting to complex linguistic variation. Thus, while computationally attractive, the method still falls short of the semantic flexibility and depth offered by contextual or neural topic modeling frameworks.
The authors of [17] propose an approach to enhance the semantic quality of classical topic models by incorporating word representations derived from static embeddings. By enriching the bag-of-words input with dense semantic vectors, the model learns more accurate word similarities and improves topic assignment. This integration enables the extraction of more coherent and interpretable topics, particularly in cases of limited lexical overlap. However, the reliance on static embeddings—which assign a single representation to each word regardless of context—limits the model’s ability to disambiguate polysemous terms or adapt to domain-specific language. Despite this constraint, the approach represents a substantial improvement over purely frequency-based methods by introducing richer lexical and syntactic information into topic models.
The authors of [18] introduce a novel topic modeling approach named WE-LDA which extends LDA [1] by integrating word embedding to enhance topic modeling in the domain of web service clustering. Experimental results obtained with the ProgrammableWeb dataset (presented with more details in [18]) demonstrate that WE-LDA outperforms LDA and other baselines and provides more coherent and meaningful topic clusters. Nevertheless, the model still has several important drawbacks. This approach [18] based on static embeddings model which cause ambiguity in topic assignments, especially when the same topic term has different meanings in different contexts.
Although static embedding-based approaches represent an improvement in semantic cohesion compared to bag-of-words models, their context-independence remains a major limitation. Each word is mapped to a single vector regardless of usage, which prevents disambiguation of polysemous terms and adaptation to domain-specific contexts. Consequently, the resulting topics often appear semantically ambiguous and less robust when applied to heterogeneous corpora.
Contextual Embeddings-based Approaches: The advent of transformer-based models has been a major turning point. Contextual embeddings generated by BERT [5], SBERT [7], and RoBERTa [6], adapt word representations based on their linguistic context, enabling more accurate and expressive semantic encoding. Inspired by these advances, recent models such as BERTopic [3] and Top2Vec [9] apply contextual embeddings to group semantically similar terms or documents, thus producing more cohesive and interpretable topics. The authors of BERTopic [3] combine transformer-based contextual embeddings with class-based TF-IDF weighting to generate interpretable topic representations. The proposed model can capture rich contextual semantics through the use of BERT-like models, and the c-TF-IDF mechanism prioritizes domain-relevant terms for enhanced topic distinctiveness. Because of its modular design, which facilitates interactive visualizations and dynamic topic reduction, it works especially well for exploratory data analysis. Even with its advantages, BERTopic is a prime example of the wider compromises found in sophisticated topic modeling techniques. It improves interpretability and semantic granularity, but it also has some drawbacks compared to clustering-based approaches. For instance, incorporating HDBSCAN [19] allows us to find semantically coherent document clusters, but at the risk of topic fragmentation due to over-clustering, which is especially the case with noisy and heterogeneous datasets. Furthermore, while context-aware embeddings have better semantic depth, they are tied to large pre-trained models, which lead to high computation costs and low scalability in low-resource settings. However, by combining contextual embeddings with density-based clustering, BERTopic is able to obtain an adequate compromise of adaptability and stability. This design makes it possible to derive stable and interpretable topic structures across different application domains
In contrast, the authors of Top2Vec [9] present an innovative topic modeling technique that simultaneously embeds words and documents into a common semantic space. Without requiring predetermined topic numbers or intensive preprocessing like stop-word removal, this approach allows the model to detect dense clusters of semantically linked content and infer topic vectors straight from the input. Top2Vec [9] produces extremely meaningful and coherent topics by using the corpus’s natural semantic linkages, which improves the interpretability and applicability of the extracted themes. Nevertheless, Top2Vec [9] has a number of drawbacks in spite of its benefits. The necessity of high-dimension embedded input and dense vector operations may cause high computation cost, which may not be suitable for large-scale or resource-limited scenarios. Moreover, the model does not have an explicit probabilistic interpretations, so it is challenging to derive the interpretable topic–word or document–topic distributions required by many analytical applications. Furthermore, the quality of the underlying embeddings and clustering methods has a direct impact on its performance, making it susceptible to instability and parameter adjustment across a range of datasets. Although Top2Vec [9] successfully improves semantic semantic coherence and minimizes the need for manual intervention, its adoption in scaled, production-ready systems is hampered by these computational and interpretability issues.
The authors of [20] demonstrate that using contextualized document embeddings significantly enhances topic coherence, because they efficiently capture complex links between concepts that conventional bag-of-words or co-occurrence-based models frequently miss by producing semantically rich and context-aware representations. The authors of [20] highlight the key role played by latent semantic structures offered by pre-trained transformer models in improving the interpretability and robustness of topic modeling. It also proves that incorporating the advanced contextual encoders can significantly enhance the quality and semantic coherence of extracted topics, which remedies the fundamental limitations of classic topic modeling methods.
The authors of [21] introduce ALBERT, which is a variation of the original BERT [5] model, aiming to produce deep language models that are faster and consume less memory. It reduces parameter count by sharing weights between layers, independently scaling the size of embeddings with the hidden layer, to be more efficient at a similar sized model. ALBERT [21] also replaced BERT’s original sentence-pair prediction task with a new task called Sentence Order Prediction to learn the sentence relationship more effectively. Because of this, ALBERT is more accurate and efficient on a wide range of natural language tasks. However, it still has certain drawbacks, though, such as being slow in prediction and requiring precise tweaking to function well, like BERT (proposed in [5]). Additionally, because it shares parameters among layers, it may not be as adaptable for really difficult tasks.
The authors of [22] present a simple three-step pipeline for semantic coherence-driven topic extraction from unlabeled text corpora. This approach combines BERT-based context embeddings to capture rich semantic representations, and structural links are preserved in a lower-dimensional space by employing UMAP (detailed in [23]) for dimensionality reduction. Then, clustering is conducted by k-means, the topic terms are obtained through TF-IDF scoring. The proposed method is simple and efficient, achieving performances equivalent to more complicated models at a lower computational cost. It has strong topic coherence and diversity and is applicable to multilingual and resource-poor scenarios. However, like all embeddings-based models, its performance comes down to the quality of the pre-trained embeddings and may need fine-tuning for domain-specific uses.
The authors of [6] present several improvements over the original BERT architecture, such as removing the Next Sentence Prediction objective, utilizing larger batch sizes and longer training durations, dynamic masking, and pre-training with much more data. These enhancements result in more powerful and contextually sensitive language representations which achieve state-of-the-art performance on various NLP benchmarks. Nevertheless, the method has drawbacks as well: the enhanced performance is attained at the expense of significant processing power and extended training periods. For academics or practitioners with low hardware capabilities, this limits RoBERTa’s accessibility. Its dependence on massive amounts of data also raises questions of environmental effect, scalability, and the possible encoding of biases in the training corpus.
Graph-based Approaches: Graph-Based Topic Model (GBTM) is structured and semantically rich alternative to classical topic models. These models view documents as graphs, with words as nodes and their syntactical or semantic relationship as edges. Such graph structure allows GBTMs to better encode both local and global semantic relationship between words, which can obtain more meaningful and interpretable topics.
The authors of [24] propose a new model named Graph Topic Model (GTM), which enhances topic modeling via document relationships. It builds a graph where nodes are documents or words, and edges model word–document co-occurrence and document similarity. By applying Graph Convolutional Networks (GCNs), the proposed model, (GTM) spreads information across the graph, allowing it to learn more contextually aware and significant subjects.
The authors of [25] propose a new approach to improve the interpretability of Graph Neural Topic Models (GNTMs), which, despite their semantic power, tend to act as black boxes. In this paper, the authors provide a local explanation framework that approximates the document-level topic predictions made by GNTMs by using a more straightforward interpretable model—Naïve Bayes—as a substitute. This method helps in revealing the effect of individual words (or node attributes) on topic assignment, and in turn offering local transparency and boosting trust on output of the model. The approach is demonstrated to provide high-quality approximation of the original GNTM on multiple benchmark datasets. Nevertheless, the explanation’s surrogate-based structure imposes some restrictions. Despite its great fidelity in many situations, the Naïve Bayes model might not adequately capture the complexities of the deep semantic representations encoded in GNTMs, particularly when working with more abstract or hierarchical topic relationships. Furthermore, the explanation’s local scope may not be able to generalize across documents or themes, which would restrict its usefulness in offering global interpretability.
On the other hand, the authors of [26] propose the Pachinko Allocation Model (PAM) which aims to improve the interpretability of derived topics by taking into account inter-topic dependence though a hierarchical directed acyclic graph. Contrary to flat topic models like LDA [1], PAM [26] encodes intricate topic dependencies and co-occurrence patterns across multiple levels of abstraction. By capturing the hierarchical nature of topics directly this model allows for a more structured and semantically informed meaning interpretation of thematic content. The resultant subject hierarchies, however, may be difficult to understand in practice due to the additional complexity that could compromise clarity.
The authors of [27] introduce NET-LDA which is a hybrid topic model that integrates the semantic document similarity into the original probabilistic framework of LDA. By building semantic graphs from contextual similarity between documents, the proposed NET-LDA merges lexically distant but semantically similar texts, resolving the problem of vocabulary mismatch in classical LDA [1]. The interpretability and semantic coherence of the generated themes are enhanced by this semantic pre-clustering, particularly in noisy and domain-specific corpora. One significant benefit is its adaptive process, which improves the model’s flexibility and topic relevance by dynamically weighing each document group’s contribution. Additionally, the modeling process is more intuitive and user-friendly with NET-LDA since it does not require a set number of topics. Nevertheless, the method has its limitations, such as the high computational cost as it needs both graph construction and clustering, and that it may be sensitive to the quality of the adopted semantic similarity measure. More importantly, although it has proved to be effective in a specific controlled environment, its applicability to very diversified or multilingual corpora requires empirical confirmation.
Graph-based models effectively capture global and hierarchical semantic relations, yet their high computational demands and structural complexity limit their practical applicability.
Neural-based Approaches: By using deep learning frameworks to identify intricate semantic patterns and contextual connections in textual data, neural topic modeling techniques constitute a substantial improvement. They use latent neural architectures, in contrast to standard models, to enhance topic coherence, flexibility, and scalability across sparse and heterogeneous corpora.
The authors of [28] present a neural topic modeling approach that combines neural networks architectures with the probabilistic topic modeling. The derived Neural Topic Model (NTM) is the result of a reinterpretation of Latent Dirichlet Allocation (LDA) (presented in [1]) in a neural network consisting of two distinct hidden layers: one to model word-level topic distributions utilizing word embeddings, and another one to model document-level topic mixtures. Such a model can surpass the limitations of unigram-based representations while keeping a probabilistic interpretable model. Additionally, the authors of [28] present a supervised extension (sNTM) that adds an extra output layer for document label prediction. This makes it possible to optimize supervised learning tasks like regression and classification as well as topic representation simultaneously. Experimental tests on rating prediction, multi-class, and multi-label tasks demonstrate that the Neural Topic Model (NTM) and supervised Neural Topic Model (sNTM) perform better than conventional models like LDA [1], especially when it comes to producing themes that are coherent and rich in meaningful multi-word expressions. Notwithstanding its advantages, the model has certain drawbacks. These include a higher risk of overfitting in the supervised form, a dependence on pre-trained embeddings, and more complicated training than traditional topic modeling techniques. However, combining the interpretability of probabilistic topic models with the representational capability of neural models is made possible by the suggested framework.
The authors of [29] propose a neural variational inference framework as a generalization of the classical topic models such as LDA [1]. The model is able to provide end-to-end training via neural networks and backpropagation The model also can effectively improve the scalability and flexibility in the topic inference. It provides a nonparametric expansion within a deep learning architecture by integrating a stick-breaking prior to support an infinite number of topics. Despite achieving efficient learning on standard standards and competitive topic coherence, the technique has significant drawbacks. These include diminished interpretability in comparison to classical approaches since neural inference is a black-box, training instability, and increased susceptibility to hyperparameter adjustment. Furthermore, the model’s dependence on intricate topologies and variational approximations may make it more difficult to evaluate and modify in real-world or resource-constrained contexts.
The authors of [30] provide an overview of the development, approaches and applications of NTM. The paper describes how NTMs go beyond classical co-occurrence-based or probabilistic graphical models by using deep learning models—such as variational autoencoders, graph neural networks, contextual embeddings, and reinforcement learning frameworks—to access richer, more flexible latent topic structures. The described approaches (mentioned in [30]) are classified according to network architecture and application context, such as short texts, multilingual settings and evolving corpora. Moreover, the survey reviews the ability of NTMs to deal with the challenges faced by topic modeling, such as semantic interpretability, robustness, and transferability. However, according to the paper, we still face challenges in scalability, interpretability, training stability, and computational resources. Therefore, the authors of [30] offer a useful starting point for future investigations that seek to optimize neural topic modeling systems for use in real-world, scalable, and interpretable applications.
Neural-based topic models leverage deep architectures to capture complex semantic patterns and enhance topic coherence, but their training instability, high computational cost, and reduced interpretability remain major obstacles to their widespread deployment.

In summary, among the various categories that have been proposed in the literature (static embeddings, contextual embeddings, graph, and neural-based approaches) to enhance and improve the quality and interpretability of topic modeling. It becomes clear that contextual embeddings-based approaches provide the most promising balance between semantic expressiveness and practical usability.

Although static embeddings represented an early stage in semantic integration, their lack of consideration for context-dependent meanings restricts their performance in capturing subtle linguistic structures. The contextual embeddings, which are produced from the transformer-based models, tune word representations based on their textual context, leading to better-formed and interpretable topic structures.

In contrast, both graph-based and neural models provide more advanced semantic modeling with more complex architectures such as graph convolution or variational inference. However, most of these models suffer from high computational complexity, low interpretability, and poor complexity scalability in practice.

In models like BERTopic [3] and Top2Vec [9], contextual embeddings combined with clustering algorithms and dimensionality reduction methods are used to extract semantically coherent topics. These models utilize the dynamic semantic representations inherent in transformer-based embeddings which result in interpretable topics better than traditional co-occurrence-based methods. However, a general drawback of these approaches is that they depend on TF-IDF (frequency-based term-weighting) methods to extract topic words. Although TF-IDF is useful for emphasizing statistically significant phrases within a cluster, it ignores the latent semantic structure that underlies the data, thus producing topic descriptors that are frequent but not always thematically representative.

On the other hand, LDA [1] is a probabilistic model where the topical structures of documents are based on both document–topic and topic–word distributions. This structure allows us to generate more statistically motivated and interpretable topic representations in which each topic is induced via a distribution over the vocabulary, and each document is represented as a mixture over topics. While LDA [1] suffers from the context-independent, bag-of-words assumption, which limits its capability to represent fine-grained, context-dependent semantics, it still serves as a strong and efficient baseline in topic modeling. It is highly transparent, and has a solid mathematical foundation, therefore suitable for a variety of NLP applications.

Accordingly, many methodologies in recent literature still focus on re-using LDA [1] (within hybrid models) to take advantage of its interpretability and statistical guarantees, while integrating additional input features to overcome its shallowness in semantics, through embeddings-based refinement. Therefore, LDA remains one of the building blocks to build more expressive and semantically enhanced topic modeling frameworks.

Taken together, these developments highlight an unresolved challenge: the integration of context-sensitive semantic representations into probabilistic inference in a way that remains both interpretable and computationally efficient. To address this gap, we introduce

S e m a T o p i c

, a unified and adaptive framework that fuses transformer-based embeddings with coherence-optimized probabilistic modeling. Our approach enables the extraction of robust, interpretable, and contextually grounded topics, while ensuring scalability and reproducibility across heterogeneous corpora.

In order to compare existing approaches and address the previously identified challenges, we define nine criteria corresponding to the two main research questions:

Challenge 1. How can context-sensitive semantic representation be integrated into topic models to generate more coherent, interpretable, and semantically meaningful topics?

Criterion 1 (C1) Document-Level Semantic Representation: Assesses whether the model incorporates semantic information from the entire document (e.g., through contextual or document embeddings). Possible values: (i) Yes or (ii) No.
Criterion 2 (C2) Word-Level Contextualization: Indicates whether word meanings vary with context (e.g., through models like BERT), allowing dynamic semantic interpretation. Possible values: (i) Yes or (ii) No.
Criterion 3 (C3) Domain Context Awareness: Determines whether the model adapts to or integrates domain-specific knowledge, vocabulary, or embeddings. Possible values: (i) Yes or (ii) No.
Criterion 4 (C4) Topic Granularity: Measures the model’s ability to identify not only broad themes but also sub-topics or hierarchical structures. Possible values: (i) Topics or (ii) Sub-Topics.
Criterion 5 (C5) Intra-topic Coherence: Evaluates the semantic coherence of words within each topic, i.e., whether topic terms are meaningfully related. Possible values: (i) High, (ii) Medium or (iii) Low.

Challenge 2. How can topic stability be ensured when the model is applied to evolving or partially modified corpora?

Criterion 6 (C6) Topic Stability Across Corpus Variations: Indicates whether the extracted topics remain consistent when the corpus undergoes small changes (e.g., addition/removal of documents). Possible values: (i) Yes or (ii) No.
Criterion 7 (C7) Term-Level Topic Stability: Checks whether the top terms within each topic are preserved across multiple runs or corpus versions. Possible values: (i) Yes or (ii) No.
Criterion 8 (C8) Hyperparameter Optimization: Reflects the degree to which model tuning (e.g., for alpha, beta, or number of topics) is automated for performance improvement. Possible Values: (i) No, (ii) Partial, (iii) Full automation.
Criterion 9 (C9) Determinism: Indicates whether the model returns the same output under identical conditions (fixed parameters and corpus), highlighting reproducibility. Possible values: (i) Yes or (ii) No.

Of the eight compared methods, significant differences are observed across the nine analytical measures. The baseline probabilistic models [11,12,18] support document-level semantic representation (C1) but do not handle dynamic word-level context (C2) nor domain-specific adaptation (C3), which are essential for fine-grained semantic inference. Newer embeddings-based models, as [3,9], incorporate both document and word-level semantic features (C1, C2), resulting in substantial gain in semantic coherence and topic stability (C5–C7). Nevertheless, their generalization capacity remains limited due to insufficient domain adaptability (C3) and the lack of automated hyperparameter tuning (C8) which hinders scalability across diverse corpora. In contrast, refs. [27,29] bring improvements in semantic grouping but fall short in consistent modeling and model reproducibility, limiting their broader applicability. Our proposed

S e m a T o p i c

model distinguishes itself by satisfying all nine evaluation criteria. It effectively captures comprehensive semantic representations at both documents and word levels (C1, C2) incorporates domain-aware embeddings (C3), and enables the extraction of both macro-topics and fine-grained subtopics (C4), while achieving high intra-topic coherence (C5). Moreover,

S e m a T o p i c

demonstrates strong stability across corpus variations (C6, C7), features fully automated hyperparameter optimization (C8), and ensures deterministic outputs (C9) making it a highly interpretable and dependable solution.

The comparison synthesis reported in Table 1 underscores the robustness of the

S e m a T o p i c

framework. Despite the fact that the list of models under comparison is not exhaustive, our approach appears to offer the most comprehensive, scalable, and semantically consistent solution among the evaluated methods.

3. Methodology

In this section, we present our proposed framework,

S e m a T o p i c

, a unified architecture specifically designed to overcome key limitations of existing topic modeling approaches. While traditional models (like LDA) rely on shallow, symmetric word co-occurrence statistics, often resulting in low semantic expressiveness and poor topic coherence,

S e m a T o p i c

leverages the deep contextual understanding of Large Language Models (LLMs) to generate semantically enriched embeddings at both the document and word levels. These contextualized embeddings are then refined through density-based semantic clustering, enabling the model to detect latent structures and form semantically coherent groups of documents. This semantically clustered layout serves as a structural foundation that facilitates the optimization of the probabilistic topic extraction process, enhancing both topic coherence and interpretability. Beyond topic extraction,

S e m a T o p i c

incorporates a dedicated stability analysis module, designed to assess the robustness of the generated topics under corpus variations (e.g., additions or removals of documents). Reproducibility and dependability of the model’s output are guaranteed by this analysis, which is essential for real-world applications where data is constantly changing. By integrating contextual embeddings, unsupervised clustering, and probabilistic modeling into a cohesive multi-stage pipeline,

S e m a T o p i c

delivers a topic modeling solution that is not only scalable and domain-adaptive, but also interpretable and robust even in the presence of heterogeneous, noisy, or evolving textual data.

The

S e m a T o p i c

framework, illustrated in Figure 1, is structured into three structurally interdependent modules, each addressing a critical aspect of modern topic modeling and together form a cohesive pipeline that balances semantic expressiveness, interpretability, and robustness:

(A) Contextual Semantic Representation: Exploits the semantic richness provided by Large Language Models (LLMs) to generate deep, context-aware embeddings at both the document and word levels. This ensures that the model captures nuanced meanings and domain-specific variations that traditional models often overlook,
(B) Coherent Probabilistic Topic Extraction: Builds on this semantic foundation by first applying density-based clustering to reveal natural groupings within the corpus. These clusters guide an optimized probabilistic inference step that extracts topics that are not only statistically valid but also semantically coherent and interpretable, and
(C) Topics Stability and Robustness Analysis: Introduces a systematic evaluation mechanism to assess the robustness and reproducibility of the discovered topics in response to corpus modifications (e.g., updates, additions, or removals). This step is essential to ensure that the topics remain consistent and comprehensible over time.

Figure 1.

S e m a T o p i c

framework.

Figure 1.

S e m a T o p i c

framework.

Thanks to this modular and flexible architecture,

S e m a T o p i c

effectively bridges the gap between semantic richness and interpretability, while also ensuring stability and scalability across diverse textual datasets and application domains. In what follows, we detail each of these components.

3.1. Contextual Semantic Representation

This module is the foundational layer of the framework and aims to preserve the underlying semantic content of raw text while encoding it into abstract representations. By preserving semantic and contextual cues, this module provides a coherent linguistic foundation for extracting meaningful topics, thereby enhancing the semantic coherence, accuracy, and interpretability of the resulting topics.

Text Preprocessing: This sub-module applies a sequence of normalization steps including tokenization, stop-word removal, lowercasing, and lemmatization that transform the texts into a more uniform representation. These steps lead to increased lexical consistency and reduced semantic ambiguity, which in turn creates a cleaner and more reliable foundation for generating contextually rich and coherent topic representations.
Named Entity Extraction: A Named Entity Recognition (NER) model is used to recognize and preserve semantically meaningful terms such as domain-specific entities and compound expressions. This also helps retain important multi-word expressions (e.g., artificial intelligence, machine learning) that might otherwise be fragmented or lost during preprocessing. By treating them as single units, the model enhances thematic consistency and contributes to more interpretable and relevant topics.
Document Representation via Mean-Pooled Contextual Embeddings: To represent documents at the semantic level, each text is first tokenized and encoded using a pre-trained BERT model, which generates contextualized embeddings for all tokens. These token-level representations are then aggregated into a single fixed-size vector by applying mean pooling, i.e., computing the average across all token embeddings in a document. The resulting document embedding constitutes a compact yet expressive semantic representation that captures both the dominant thematic content and the contextual nuances of the text, while attenuating the influence of noise and local variations. This representation not only provides robustness for long or heterogeneous documents but also serves as an effective prefiltering mechanism: when used as input to the clustering stage, it ensures that documents are grouped according to their semantic proximity. In turn, this facilitates the extraction of more coherent, interpretable, and stable topics in the subsequent modeling phase.

3.2. Coherent Probabilistic Topic Extraction

This module is responsible for extracting semantically meaningful and interpretable topics by extending the traditional probabilistic modeling approach (LDA). It combines the statistical strengths of LDA with the semantic richness of context-aware embeddings generated by BERT. This integration enables the model to capture both latent thematic distributions and nuanced linguistic patterns. By leveraging probabilistic topic distributions alongside contextual document representations, the framework ensures that the topics extracted are not only statistically sound but also thematically coherent and human-interpretable. This dual approach bridges the gap between classical topic modeling and modern semantic understanding, leading to more robust and informative topic structures.

Dimensionality Reduction of Embeddings: The contextual document embeddings generated by language models such as BERT reside in high-dimensional spaces, which can negatively impact clustering and semantic analysis. While these embeddings are semantically rich, their high dimensionality can hinder both clustering performance and semantic pattern detection due to the curse of dimensionality, where data points become increasingly sparse and equidistant. To address this issue, we apply Uniform Manifold Approximation and Projection (UMAP) [23], a non-linear dimensionality reduction algorithm specifically designed to preserve both local and global semantic relationships during the projection process. UMAP reduces the dimensionality of the embeddings to a more computationally efficient space (typically 5–15 dimensions), enabling better discrimination between semantically distinct regions and reducing noise or redundant variance. This step enhances computational efficiency and improves cluster separability, making subsequent clustering results more interpretable and semantically coherent. It serves as a crucial bridge between deep contextual representations and probabilistic topic inference.
Semantic Document Clustering: Although document embeddings provide rich semantic information, their continuous and high-dimensional nature makes them incompatible with traditional probabilistic topic models such as LDA, which require discrete input representations. To overcome this limitation, we apply semantic clustering to group similar documents into thematically coherent subsets, thereby pre-structuring the corpus prior to probabilistic inference. Specifically, we employ HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) [19], a density-based clustering algorithm capable of automatically inferring the number of clusters, handling varying shapes and densities, and filtering out noisy data points. In contrast to centroid-based approaches such as k-means, HDBSCAN [19] is particularly well-suited for textual corpora, which are often heterogeneous and noisy. To improve cluster separability, we integrate UMAP [23] for dimensionality reduction, as it preserves both local and global semantic relationships more effectively than alternatives such as PCA [16] or t-SNE [31]. The combination of UMAP [23] and HDBSCAN [19] enables the identification of semantically coherent clusters without requiring prior knowledge of the number of clusters, ensuring robustness to noise and enhancing the interpretability, cohesion, and accuracy of the topics extracted in subsequent modeling stages.
Automatic Optimization of $SemaTopic$ Parameters: To enhance the semantic coherence, semantic relevance, and reproducibility of the extracted topics, we integrate an automated hyperparameter tuning procedure into our extended LDA-based topic modeling module. This model, adapted to each document cluster, incorporates contextual cues from the embeddings and clustering stages, thus mitigating the limitations of classical co-occurrence-based models. Instead of relying on fixed or manually selected hyperparameters, our framework explores a predefined configuration space to identify the optimal set of values for each cluster, ensuring that topic extraction is tailored to the semantic granularity and distribution of each group. The optimization process includes the following parameters:
- $α$ : Controls the sparsity of topic distribution per document. Smaller values encourage each document to focus on fewer topics (i.e., more specific themes), while larger values allow for broader topic mixtures. We sample $α$ in the range $[0.01, 1.0]$ with increments of 0.1 to explore varying levels of topical concentration.
- $β$ : Regulates the sparsity of the word distribution per topic. A smaller $β$ results in topics composed of a few highly distinctive terms, enhancing interpretability; a larger $β$ produces broader vocabularies. We similarly discretize $β$ in $[0.01, 1.0]$ with steps of 0.1.
- $K \in [2, 100]$ : Defines the number of latent topics to extract per document cluster. This allows the model to adapt to clusters of varying sizes and thematic diversity, supporting both fine-grained segmentation and broader thematic overviews.
The proposed framework integrates an automated, data-driven configuration tuning mechanism to optimize key hyperparameters such as $α$ , $β$ , and the number of topics K. For each tested configuration, semantic coherence is quantitatively measured using the $C_{v}$ metric. The configuration that achieves the highest semantic coherence score is automatically selected, ensuring that the resulting topics are both interpretable and contextually meaningful. The semantic coherence score $C_{v} (t)$ for a given topic t is computed as follows:

$C_{v} (t) = \frac{1}{| P_{t} |} \sum_{(w_{i}, w_{j}) \in P_{t}} NPMI (w_{i}, w_{j}) \times cos ({\vec{w}}_{i}, {\vec{w}}_{j})$

(1)

where:
- t: A topic extracted by the model.
- $P_{t}$ : The set of all unordered word pairs $(w_{i}, w_{j})$ within the top-N most probable words in topic t, i.e., $P_{t} \subseteq {(w_{i}, w_{j}) ∣ w_{i}, w_{j} \in W_{t}, i < j}$ , where $W_{t}$ is the top-N word list of topic t.
- $NPMI (w_{i}, w_{j})$ : Normalized Pointwise Mutual Information between $w_{i}$ and $w_{j}$ , estimated from a reference corpus using a fixed-size sliding context window.
- ${\vec{w}}_{i}, {\vec{w}}_{j}$ : contextual embeddings vectors of $w_{i}$ and $w_{j}$ , obtained from a transformer-based model (e.g., BERT).
- $cos ({\vec{w}}_{i}, {\vec{w}}_{j})$ : Cosine similarity between the embeddings serving as a proxy for semantic closeness.
Traditional probabilistic topic models such as LDA suffer from high sensitively to manually set hyperparameters. In practice, exhaustive tuning is often avoided due to its substantial computational cost, and most users rely on a small number of hand-selected configurations often insufficient to capture optimal semantic granularity. This limited coverage can significantly hinder topic coherence and interpretability. To overcome this limitation, the proposed framework embeds an end-to-end semantic coherence-based optimization pipeline that automatically searches the configuration space to maximize topic coherence, as measured by $C_{v}$ . This optimization process is structured into three distinct phases, each targeting a specific aspect of parameter tuning to enhance topic coherence:
(i)
Model initialization: The search begins with a range number of topic numbers K in [2–100]. For each value of K, all possible combinations of Dirichlet priors $α$ and $β$ within the interval $[0.01, 1.0]$ are tested. This dense sampling of the hyperparameter space ensures an exhaustive and fine-grained optimization. It allows the framework to identify, for each cluster, the configuration that yields the highest semantic coherence. This approach marks a significant improvement over traditional manual tuning, which is often limited to a small and heuristic subset of configurations.
(ii)
Semantic coherence evaluation: Each candidate configuration from the hyperparameter grid is evaluated using semantic coherence metric $C_{v}$ , as defined previously. This score provides a proxy for human interpretability by measuring the semantic consistency among top-ranked topic terms. Its systematic use ensures fair and reliable assessment of topic quality across different model instances.
(iii)
Parameter selection: For each semantically coherent document cluster, the configuration with the highest $C_{v}$ score is automatically selected. This ensures that the resulting topic distributions are not only statistically optimal but also thematically meaningful and contextually coherent.
This built-in optimization procedure eliminates the need for manual parameter tuning, allowing the model to dynamically adapt to various corpus structures and domain-specific characteristics. As a result, the framework improves generalization and usability across a wide range of real-world applications. In particular, this automatic coherence-driven hyperparameter search avoids manual trial-and-error, ensures optimal settings per cluster, and improves both topic quality and reproducibility.

Topic Extraction: Once the best hyperparameter setting is found,

S e m a T o p i c

can extract topics from semantically coherent clusters with the best parameters, enhancing readability and discrimination. These locally refined topics are then combined to generate a coherent, semantically rich global topic set.

Within each semantic cluster, once the optimal hyperparameters

(α, β, K)

are identified, extended LDA is applied to infer the internal thematic distributions. If the inference yields a single coherent distribution, the cluster is considered as a global topic. Conversely, if two or more semantically coherent distributions emerge, the cluster is modeled as a topic with sub-topics. Importantly, this decision is guided by the semantic coherence of the inferred distributions, rather than by the cluster size. In this way, sub-topics arise only when the probabilistic model reveals sufficient internal separability, ensuring that granularity reflects semantic reality rather than arbitrary thresholds.

Figure 2 provides a schematic illustration of the sub-topic detection process in

S e m a T o p i c

. Semantic clusters, obtained through HDBSCAN, are refined using extended LDA with optimized hyperparameters

(α, β, K)

. A cluster may either remain as a single global topic when only one coherent distribution is inferred, or be decomposed into multiple sub-topics when several coherent and distinct distributions are detected. The figure exemplifies the latter case, where Cluster 2 is divided into three semantically coherent sub-topics, highlighting how

S e m a T o p i c

captures both macro- and micro-level thematic structures.

The full generative process of the

S e m a T o p i c

model is detailed in Algorithm 1. Its probabilistic formulation and component interactions are illustrated in Figure 3.

S e m a T o p i c

extends classical topic modeling by introducing an automatic and context-aware framework that enhances both semantic coherence and topic distinctiveness. The method ensures interpretability and robustness by first segmenting the document space within each cluster, and then applying probabilistic topic modeling within each cluster. This modular architecture enables the generation of richer and more coherent topic structures, while addressing several key limitations of traditional probabilistic models particularly their limited adaptability to diverse corpora and their reliance on manual hyperparameter tuning.

Algorithm 1 Generative Process of SemaTopic.

Require: Corpus D, pre-trained BERT model, semantic coherence metric

C

1:: Generate contextual embeddings ${\vec{e}}_{d}$ for each document $d \in D$ using BERT
2:: Apply UMAP to reduce dimensionality of embeddings
3:: Cluster documents using HDBSCAN to obtain semantic clusters $S = {C_{1}, . . ., C_{M}}$
4:: for each cluster $C_{j} \in S$ do
5:: for each combination of $α \in [0.01, 1]$ , $β \in [0.01, 1]$ (step 0.1), and $K_{j} \in [2, 100]$ do
6:: for each topic $k = 1$ to $K_{j}$ do
7:: Draw topic–word distribution $ϕ_{k} \sim Dirichlet (β)$
8:: end for
9:: for each document $d \in C_{j}$ do
10:: Draw topic proportions $θ_{d} \sim Dirichlet (α)$
11:: for each word $w_{d, i}$ in d do
12:: Sample topic assignment $z_{d, i} \sim Multinomial (θ_{d})$
13:: Sample word $w_{d, i} \sim Multinomial (ϕ_{z_{d, i}})$
14:: end for
15:: end for
16:: Evaluate semantic coherence score $C$ for current configuration
17:: end for
18:: Select $(α^{*}, β^{*}, K_{j}^{*})$ maximizing $C$
19:: Retain topic distributions from optimal configuration
20:: end for
21:: Merge topics from all clusters to form the final global topic set

Table 2 defines the main variables used in the

S e m a T o p i c

model. It includes document-level parameters such as topic proportions (

θ_{d}

), word-level assignments (

z_{d, i}

), and BERT embeddings (e), as well as cluster-level components (

C_{j}

) and topic–word distributions (

ϕ_{k}

). These variables describe the generative process and structure of the model.

Table 3 presents the core variables involved in the generative process of the standard Latent Dirichlet Allocation (LDA) model. The model assigns a topic distribution

θ

to each document, governed by the Dirichlet prior

α

. For each word position i, a latent topic

z_{i}

is sampled from

θ

, and the actual word

w_{i}

is drawn from the word distribution of the corresponding topic

φ

, which is itself drawn from another Dirichlet prior

β

. These probabilistic components collectively describe how documents in a corpus are generated based on latent topic structures.

The plate notation diagram of

S e m a T o p i c

mentioned in Figure 3 is a methodological refinement of the traditional probabilistic topic models, allowing for semantic understanding and structural granularity. Whereas traditional LDA (related plate notation presented in Figure 4) processes the corpus as a uniform container of documents,

S e m a T o p i c

combines the use of contextual embeddings to divide the corpus into coherent thematic clusters. In the second step, topic inference is performed in each cluster to further strengthen thematic distinctiveness and semantic coherence.

Incorporating embeddings from Large Language Models into our framework allows us to extract more coherent and semantically meaningful topic representations by effectively capturing the context-dependent subtleties of word usage. At the same time, automatic hyperparameter tuning makes sure that each semantic group is modeled under the best settings, without requiring ad hoc tuning by the user, thus enhancing the overall quality of the extracted topics. This structured, context-aware model improves the interpretability and semantic coherence of topics, but does not lose the probabilistic ground of traditional models like LDA.

3.3. Topics Stability and Robustness Analysis

This section assesses the stability and robustness of the

S e m a T o p i c

framework under corpus perturbations. The topic stability is crucial to ensure that the thematic structure remains consistent even when the input data is partially modified, a common situation in real-world applications where corpora are often noisy, incomplete, or evolving over time.

Corpus Modification: To simulate realistic variations in input data, a perturbed version of the original corpus D is created, denoted $D^{'}$ . Two types of modifications are applied:
- Addition: New documents are added to corpus D.
- Removal: A random subset of documents are removed which is intended to simulate data loss, sampling bias, or incomplete data.
These modifications aim to test whether the model can maintain consistent topic structure despite changes in data distribution.
Re-Extraction of Topics: The $S e m a T o p i c$ model is then reapplied to the modified corpus $D^{'}$ , using the same semantic clustering structure and hyperparameter settings that were applied to the original corpus D. This ensures that changes in the resulting topic set $T^{'}$ are solely due to corpus perturbation, not to differences in model configuration. The new topics $T^{'}$ are then compared to the original topic set T.
Topic Similarity Measurement: To evaluate the degree of alignment between the original and perturbed topic sets, three complementary metrics are employed:
- Jaccard Similarity: Computes the word overlap between the top-n terms of matched topics in T and $T^{'}$ , indicating lexical consistency.
- Cosine Similarity: Measures the similarity between full topic–word probability distributions (treated as vectors), capturing global distributional alignment.
- Topic Match Ratio: Represents the proportion of topics in the original set T that have at least one matching topic in $T^{'}$ with a similarity score above a predefined threshold $τ$ .
Validation of Topic Consistency: We measure the topic stability under corpus perturbations using computed similarity scores. The high values of similarity among various evaluation metrics suggest that the model consistently captures the coherent topic structures and also show that it is robust and reliable model. Conversely, significant divergence among runs may indicate sensitivity to data variance, clearly visible in specific sections for additional calibration or system improvements.

This stability and robustness analysis reinforces the framework’s practical value by demonstrating its ability to provide trustworthy and generalizable topics in the face of incomplete or evolving datasets a key requirement for scalable, real-world applications of topic modeling. The topic stability evaluation procedure is detailed in Algorithm 2.

This algorithm quantifies the stability and robustness of the

S e m a T o p i c

framework by evaluating the consistency of topics under controlled corpus perturbations. In particular, the model is applied to both the original corpus D and a slightly modified version

D^{'}

, where perturbations simulate real-world scenarios such as the addition or removal of documents. This yields two sets of topics: T (from D) and

T^{'}

(from

D^{'}

). To assess how closely the two topic sets align, each topic

t_{i} \in T

is compared to the most similar topic

t_{j}^{'} \in T^{'}

, using two complementary similarity measures:

Jaccard similarity, based on the overlap of top-n topic words,
Cosine similarity, computed over the full topic–word distributions treated as high-dimensional vectors.

These pairwise comparisons are aggregated into two key metrics:

Average Similarity Score $\bar{S}$ , for both Jaccard and Cosine, capturing the global alignment between topic structures;
Topic Match Ratio (TMR) R, measuring the proportion of original topics in T that have at least one closely matching counterpart in $T^{'}$ , based on a predefined similarity threshold $τ$ .

High values of

\bar{S}

and R indicate that the topics extracted by

S e m a T o p i c

are stable, reproducible, and resilient to data variations, validating the model’s robustness and reliability in dynamic or noisy environments.

Algorithm 2 Robustness Evaluation of SemaTopic.

Require: Original corpus D, modified corpus

D^{'}

, trained model

M

, similarity threshold

τ

1:

Run

S e m a T o p i c

on D to obtain original topic set T

2:

Run

S e m a T o p i c

on

D^{'}

to obtain perturbed topic set

T^{'}

3:

for each topic

t_{i} \in T

do

4:

Identify most similar topic

t_{j}^{'} \in T^{'}

using:

Jaccard similarity over top-n terms
Cosine similarity over full topic–word distributions

5:

Compute similarity scores

S_{i j}^{Jaccard}

and

S_{i j}^{Cosine}

6:

end for

7:

Compute average Jaccard similarity:

{\bar{S}}_{Jaccard} = \frac{1}{| T |} \sum S_{i j}^{Jaccard}

8:

Compute average Cosine similarity:

{\bar{S}}_{Cosine} = \frac{1}{| T |} \sum S_{i j}^{Cosine}

9:

Count number of topics where

S_{i j}^{Cosine} \geq τ

or

S_{i j}^{Jaccard} \geq τ

10:

Compute Topic Match Ratio:

R = \frac{# Matched Topics}{| T |}

11:

return

{\bar{S}}_{Jaccard}, {\bar{S}}_{Cosine}, R

4. Experiments

In order to systematically evaluate the performance and reliability of the proposed

S e m a T o p i c

framework, we designed a comprehensive experimental protocol focusing on three key aspects: topic quality and semantic expressiveness, topic stability under corpus perturbation, and comparative benchmarking with state-of-the-art models. Each experimental objective is designed to highlight a distinct dimension of the model’s effectiveness, robustness, and competitiveness.

4.1. Environment

S e m a T o p i c

is implemented in Python (version 3.9.13) and integrates standard libraries for semantic topic modeling. The experimental pipeline includes the following steps:

Text Preprocessing: Conducted using NLTK (https://www.nltk.org/) for tokenization, lemmatization, subword normalization, and domain-adapted stopword filtering.
Contextual Embeddings: Generated via the SentenceTransformers (https://www.sbert.net/) library, using pre-trained BERT-based models fine-tuned for sentence-level semantic similarity.
Named Entity Recognition (NER): Applied to preserve domain-specific entities and compound concepts, thereby enhancing thematic semantic coherence.
Dimensionality Reduction: Two techniques were employed:
−
UMAP [23] for semantic structure-preserving reduction prior to clustering.
−
t-SNE [31] for visualizing topic separability in low-dimensional space.
Semantic clustering: Performed using HDBSCAN [19] due to its capacity to detect arbitrary-shaped clusters and filter out noise without requiring the number of clusters in advance.
Topic modeling and semantic coherence scoring: Managed through the Gensim (https://pypi.org/project/gensim/) library (19 July 2024), which enables probabilistic modeling via LDA and supports semantic coherence evaluation using the $C_{v}$ metric.

All experiments were performed on a standard laptop with Intel Core i7-6500U CPU @ 2.59 GHz, 8 GB RAM, and 64 GB SSD (Santa Clara, CA, USA), demonstrating that the framework is computationally accessible and reproducible under limited hardware constraints.

4.2. Execution Time by Stage

We report average wall-clock times per stage on a standard laptop (Intel Core i7-6500U @ 2.59 GHz, 8 GB RAM, 64 GB SSD). Peak memory was not tracked in this release. Stages: (i) preprocessing and NER, (ii) embeddings, (iii) dimensionality reduction, (iv) clustering, and (v) topic inference with semantic coherence-driven hyperparameter search.

To ensure transparency and reproducibility, the complete

S e m a T o p i c

pipeline is summarized in Table 4.

The framework consists of sequential stages, including preprocessing, embedding generation, dimensionality reduction, semantic clustering, probabilistic topic inference, and stability analysis. For each stage, the table reports implementation details, fixed hyperparameters, and execution time (measured on the Google Scholar corpus accessed on 11 July 2025, (https://drive.google.com/drive/folders/1F3vebIkREgKA2Bh9ScsCnRkIuR2t6lD-?usp=drive_link)). Explicitly documenting these design choices highlights the methodological rigor of

S e m a T o p i c

and facilitates replication in future studies.

This structured overview also provides a road map for the subsequent subsections, where each component of the pipeline is described and motivated in greater detail.

4.3. Experimental Protocol

4.3.1. Objective 1: Topic Quality and Semantic Expressiveness Evaluation

The primary goal of our experimentation is to evaluate the semantic effectiveness and semantic coherence of the extracted topics with the help of

S e m a T o p i c

. This requires assessing how effectively our model learns interpretable and contextually coherent topic structures across varied textual input data. More precisely, we aim in this phase to measure to which extent the generated topics are semantically expressive, how well they reflect coherent topics and distinct conceptual boundaries. All experimental results to verify this goal are reported in the subsection “Evaluation Results” with more details.

Dataset:
A domain-specific corpus was collected from Google Scholar, composed of 965 scientific papers. The dataset comprises diverse academic disciplines and was divided as follows: 169 scientific articles related to Economics, 155 documents in Astronomy, 130 in Management, 170 in Nutrition, 130 in Music, and 211 papers in Computer Science. We tried to prepare a balanced thematic distribution corpus in order to provide a robust foundation for evaluating the semantic extracted topics using our model $S e m a T o p i c$ across heterogeneous knowledge domains (the used dataset is available at: https://drive.google.com/drive/folders/1rVBhcPeYYSQ_bT9HMwkWRY2T1oNsTrzN?usp=drive_link, accessed on 11 July 2025).
Evaluation metrics:
- Topic coherence: In order to get higher-precision domain semantics, the $S e m a T o p i c$ system was based on cluster-wise topic modeling. After learning document embeddings and applying unsupervised clustering by integrating HDBSCAN, the corpus was partitioned into four clusters that correspond to semantically coherent sub-domains of scientific literature. This segmentation allows us to apply a local topic modeling approach where our model ( $S e m a T o p i c$ ) is trained for each cluster to better accommodate the granularity of internal topics. The best-topic model for each cluster is determined by an exhaustive grid search over a three-dimensional hyperparameter space including:
  −
  Number of topics $K \in [2, 100]$
  −
  Document–topic prior $α \in [0.01, 1]$
  −
  Topic–word prior $β \in [0.01, 1]$
- Overfitting analysis: To examine the topic distribution result and search for overfitting, two dimension reduction methods were applied: UMAP (Figure 5) and t-SNE (Figure 6). This can be used to evaluate topic separability and internal cohesion.
- Graph-based visualization: Graph-based visualization is an efficient method to qualitatively evaluate the semantic structure and separability of derived topics. It gives insights on how topic terms are structured and related within and across clusters.

4.3.2. Objective 2: Robustness and Stability Assessment

The second goal of our experimental study is to test the robustness and stability of the proposed SemaTopic model by systematically adding various perturbations to the input corpus. We have also performed all the experiments for validating this objective presented in detail in the “Evaluation Results” subsection.

Dataset: In order to evaluate the robustness of our approach, we used a domain-specific corpus composed of 946 scientific articles. A controlled perturbation was performed randomly excluding 33 documents from the original corpus (This corpus is publicly available accessed on 11 July 2025 https://drive.google.com/drive/folders/1F3vebIkREgKA2Bh9ScsCnRkIuR2t6lD-?usp=drive_link).
Evaluation metrics:
- Rank-Biased Overlap (RBO): This metric was used for measuring top-term lexical similarity between model runs.
- Word-probability weight differences: This metric was used to quantify the difference of probability distribution over words in runs.
- Jaccard similarity index: This is computed on the top-n words of original and perturbed topics.
- Cosine similarity matrix: This is calculated between the topic vectors of perturbed and original corpora.
- Topic Matching Rate (TMR): This measures how well topics in the original corpus aligned with those from the perturbed corpus.

4.3.3. Objective 3: Comparative Analysis with State-of-the-Art Methods

The goal of this step is to assess the effectiveness of our proposed method by comparing it to two state-of-the-art approaches, BERTopic [3] and LDA+ [10], on the same data at different criteria.

To achieve this objective, we adopted a bidirectional evaluation strategy of qualitative and quantitative features to rigorously evaluate the effectiveness of our proposed topic modeling approach. The two complementary evaluation approaches are described as follows:

Qualitative Evaluation: Graph-based analysis was used to qualitatively analyze the structural and semantic quality of the generated topic models. This method was performed through the visualization and interpretation of topic-term bipartite networks, allowing one to analyze key features like modularity, the density of graphs, term redundancy, and cluster distance.
Quantitative Evaluation: This module was based on the topic coherence as the main evaluation mechanism. Semantic coherence scores are numerical estimates of the resemblance between terms ranked highly in each topic. Higher scores imply better semantic alignment and, therefore, better topic quality.

In addition to assessing semantic effectiveness and comparative performance against state-of-the-art models, it is crucial to evaluate the scalability and multilingual adaptability of the proposed framework. This objective aims to determine whether

S e m a T o p i c

can efficiently process larger document collections while preserving topic coherence, and whether it remains robust when applied to corpora containing multilingual content.

4.3.4. Objective 4: Evaluation of Scalability and Multilingual Robustness

This experiment was designed to validate two central hypotheses: (i) that

S e m a T o p i c

can scale efficiently to larger corpora while maintaining computational stability, and (ii) that it can preserve topic coherence when applied to multilingual text collections. Unlike the earlier experiments, which relied on thematically heterogeneous but primarily monolingual datasets, this evaluation explicitly stresses the framework under both corpus growth and cross-lingual variation. The detailed results are reported in the subsection “Evaluation Results.”

Dataset: The evaluation was conducted on an extended corpus derived from the CEUR-WS (https://ceur-ws.org/) proceedings, comprising approximately 1700 documents from multiple workshops and academic domains. To ensure reliable semantic modeling, only documents with a textual coverage exceeding 50% of their content were retained. In addition to English texts, the dataset includes a substantial number of non-English terms (e.g., Russian), thereby providing a natural benchmark for testing multilingual adaptability. This setup enables a joint evaluation of scalability and multilingual robustness.
Evaluation metrics:
- Scalability assessment: Execution time and memory usage were measured across the main pipeline stages (embedding generation, dimensionality reduction, clustering, and topic inference). This analysis allowed us to quantify computational efficiency and identify trade-offs associated with processing larger and more heterogeneous corpora.
- Multilingual robustness: Topic coherence scores were computed and complemented with qualitative inspections of graph-based topic visualizations. This dual perspective was employed to assess whether semantically coherent topics could still be identified despite the presence of non-English terms. The persistence of meaningful clusters across languages was considered a key indicator of multilingual robustness.
- Graph-based visualization: A global topic–word graph was generated to illustrate the semantic integration of multilingual terms within clusters. These visualizations provide additional qualitative evidence of the framework’s ability to capture cross-lingual relations while preserving internal topic coherence.

4.4. Evaluation Results

In this section, we report the experimental results structured according to the three main goals of our study. We present evaluation metrics for each of the objectives and describe the results on each of them in detail, thus providing a complete evaluation of the performance of the

S e m a T o p i c

model in terms of semantic coherence, robustness and interpretability.

4.4.1. Objective 1: Topic Quality and Semantic Expressiveness Evaluation

Topic coherence: In 100 configurations for each cluster, the topic model was varied in steps of 0.1 for possible values of $α$ and $β$ . The $C_{v}$ score was employed to measure the semantic coherence of each candidate, which is a metric to measure the semantic regularity of the top terms by word co-occurrence statistics and the contextual similarity. It is also an important index of semantic coherence, which can automatically search for the most interpretable topic structure for each cluster.
As shown in Table 5, using the $C_{v}$ semantic coherence measure not only guarantees a high semantic consistency, but also improve the interpretability and the robustness of the model. In all four semantic clusters, a two-topic solution was clearly preferred indicating the existence of compact and thematically concentrated groups of documents. Furthermore, the difference between the optimal hyperparameters $α$ and $β$ for each cluster demonstrates the adaptive property of our proposed approach ( $S e m a T o p i c$ ). The model does not adopt a uniform set of parameters for the entire corpus, but it adjusts the configuration so that it reflects local distributional properties in each cluster. This adaptive tuning leads to more precise and interpretable topic extraction and reinforcing the effectiveness of the cluster-wise decomposition strategy of the $S e m a T o p i c$ framework.

Table 5. Optimized hyperparameters for each semantic cluster.

Table 5. Optimized hyperparameters for each semantic cluster.

Cluster Number of Topics (K) Topic-Document Prior ( $α$ ) Topic–Word Prior ( $β$ )
Cluster 0 2 0.81 0.01
Cluster 1 2 0.92 0.21
Cluster 2 2 0.87 0.41
Cluster 3 2 0.91 0.01

Table 6 summarizes the distribution of the semantic clusters utilized in the topic modeling process, which are represented in terms of the number of documents and total words assigned to each cluster. These clusters are the ones for which the best hyperparameters have been previously determined (as presented in Table 5). The table also shows different cluster sizes, Cluster 3 having the largest size (128 documents) and Cluster 1 being the smaller one (31). Nevertheless, the global word counts are still quite similar, so the lexical representation is balanced over the clusters.

Table 6. Corpus statistics per semantic cluster.

Table 6. Corpus statistics per semantic cluster.

Cluster ID Number of Documents Words Count
Cluster 0 89 6795
Cluster 1 31 2011
Cluster 2 97 6820
Cluster 3 128 6347

This adaptive modeling approach overcomes the drawback of uniform topic modeling by adjusting the topic granularity to the semantic alignment of the cluster. As a result, in addition to building more coherent local topics, $S e m a T o p i c$ can boost the system-wide robustness of the topic modeling pipeline, demonstrating the scalability and adaptability of the framework over domain-specific semantics.
Overfitting Analysis: In the optimal case, on the reduced space, each cluster forms a clearly separated and compact group without overlapping, indicating meaningful topic boundaries.
Figure 5 and Figure 6 represent 2D visualizations of the topic assignments using UMAP [23] and t-SNE, respectively, which provides complementary views of the structure and semantic coherence of the topics in our proposed approach. Each document is a point. Similar points are closer to each other in the space. As can be seen, both projections are characterized by well-separate and clear topic clusters, and little sign of overlap.
This spatial segregation in topic word distributions is a strong indication that the topics are semantically coherent and non-redundant. In addition, the agreement of UMAP [23] and t-SNE [31] verifies the robustness of our approach and its ability to generalize, i.e., the model is able to learn useful patterns rather than overfitting to noise or superficial correlations.
Graph-based visualization:
Figure 7 shows a topic-term bipartite graph that reflects the topic structure learned by the $S e m a T o p i c$ model on a semantically clustered scientific dataset. In this visualization:
−
Large colored nodes represent topics, where each color corresponds to a different semantic cluster. The graph comprises four clusters (Cluster 0 to Cluster 3), each containing two topics, for a total of eight topic nodes.
−
Smaller blue nodes correspond to the most significant terms (i.e., top-ranked words) related to each topic, as determined from their inferred topic–word distributions.
−
Edges connect each topic node with its highest words to give a sense of the content and uniqueness of the topic.

Figure 7. Bipartite graph representation of topics and terms across semantic clusters.

Figure 7. Bipartite graph representation of topics and terms across semantic clusters.

The visualization provides several obvious patterns. First, term sets for various topics are clearly separated with little overlapping. This demonstrates a high topic exclusivity and a strong semantic coherence. Second, the intra-cluster topic pairs are more concentrated, while inter-cluster topics are distributed in a separate manner in the semantic space. This validates the rationality of the clustering-based decomposition before the means of topic modeling. The separation confirms that $S e m a T o p i c$ not only focuses on fine-grained semantic topics inside a cluster but also maintains diversity between clusters. Therefore, the graph provides qualitative evidence of both topic visibility and semantic cluster in order to supplement the overall interpretability and semantic coherence of the resultant topic structure.
Topic-term associations in the global view (Figure 7) may look compressed because the nodes are densely layered. In this respect, Figure 8 provides an enlarged view of the graph around Cluster 1—Topic 1 and Cluster 3—Topic 2. This local perspective makes word boundaries more clearly visible and improves the readability of terms, which is important for a text-approximation-based qualitative analysis of topic content.

Figure 8. Zoomed bipartite visualization highlighting topic-term structures in Cluster 1—Topic 1 and Cluster 3—Topic 2.

Figure 8. Zoomed bipartite visualization highlighting topic-term structures in Cluster 1—Topic 1 and Cluster 3—Topic 2.

In Figure 8, words like dementia, symptoms, and agitation are highly associated with Cluster 1—Topic 1, indicating a cohesive topic theme concerning neurodegenerative disorders. On the contrary Cluster 3—Topic 2 is also characterized by words like grammar, nodes, document, which represent a semantic focus on structural or computational linguistics. This localized visualization provides a qualitative indication of how well the model is able to produce semantically coherent and topically distinct themes, and further demonstrates the interpretability and discriminative capability of the $S e m a T o p i c$ framework across diverse semantic contexts.
Importantly, the used dataset accessed on 11 July 2025, (https://drive.google.com/drive/folders/1rVBhcPeYYSQ_bT9HMwkWRY2T1oNsTrzN?usp=drive_link) does not explicitly contain documents related to neurodegenerative diseases or references to computation linguistics. Nonetheless, the discovery of thematically coherent topics—corroborated by their associated terms—showcases the power of the model to learn latent semantic structures. This capability demonstrates the strength of $S e m a T o p i c$ to uncover sensible and interpretable topics that do not depend on predefined domain annotations and confirms its suitability for unsupervised exploration of complex and heterogeneous corpora.

4.4.2. Objective 2: Robustness and Stability Assessment

Stability evaluation:
In Figure 9, we offer a detailed analysis on the stability of $S e m a T o p i c$ against varying runs using two different measures: RBO (RankBiased Overlap) for top-term similarity, and the difference of word-probability weight. The derived RBO score values are indicative of the high level of the lexical consistency of Topic 0 for which they range between 0.77 and 0.87, while Topic 1 fluctuates at a moderate level. At the same time, the small differences in probability weights (between 0.00 and 0.0021) show strong distributional stability across runs. Taken together, these results supported the stability and reproducibility of $S e m a T o p i c$ under stochastic perturbations, and supported the fact that our approach is reliable for generating similar topic structures for semantic analysis downstream, supporting it as a reliable method for deriving consistent topics for subsequent semantic analysis.

Figure 9. Assessing topic stability in $S e m a T o p i c$ across runs.

Figure 9. Assessing topic stability in $S e m a T o p i c$ across runs.
Robustness evaluation:
For the robustness analysis, the modified corpus was used to re-run the $S e m a T o p i c$ model so as to check if it is robust to input variability. Despite the data reduction, the model still recovered the same number of semantic clusters (four), supporting its structural stability.
Figure 10 and Figure 11 show the cosine similarity and the Jaccard similarity matrices with high similarity between the original and the perturbed list of topics. Some pairs of topics across the runs have high cosine similarity (e.g., 0.8, 0.7, 0.6), and the Jaccard overlaps are larger than 0.5 for the key matched pairs, confirming that semantic consistency is maintained not only in terms but also in distribution.
Furthermore, the TMR analysis in the bar chart (Figure 12) demonstrates that $S e m a T o p i c$ preserved at least 50% topic match for all clusters, and some clusters even achieved a perfect match rate (TMR = 1.0). Together, these results confirm the robustness of $S e m a T o p i c$ under realistic corpus perturbations. Our proposed model preserves the high-level topic structure of the real-world dataset, and keeps the semantic coherence in topic extraction, indicating its robustness in handling dynamic or partial real-world datasets.

4.4.3. Objective 3: Comparative Analysis with State-of-the-Art Methods

Qualitative evaluation: The comparative study between topic-term network visualizations obtained by different topic modeling techniques (BERTopic [3] (Figure 13), LDA+ [10] (Figure 14) and

S e m a T o p i c

(Figure 7)) reveals the semantic and structural improvement of

S e m a T o p i c

over others. As shown in Figure 13, BERTopic (based on strong transformer-based embeddings) suffers from high topic fragmentation, creating an overly dense and interconnected graph in which semantic boundaries are not clearly delineated and many keywords are redundantly found in multiple topics. This absence of separation reduces interpretability and limits the convenience of domain applications. On the other hand, LDA+ representation in Figure 14 is sparser and more readable, though it is a global topic model which does not capture the heterogeneous subdomains of the corpus. It can lead to too generic topics with no concrete detail and ignorance of local semantic variations.

On the other hand, the

S e m a T o p i c

graph in Figure 7 shows strongly connected clusters, each consisting of two more coherent and significantly different topics designed for the semantic area. Low redundancy and strong topical-term cohesion cause the visual spacing and fewer edge overlaps between terms. This design guarantees that each cluster is modeled with the appropriate topic granularity, leading to semantically rich and non-overlapping topics. Furthermore, the decrease in graph density, as well as increased modularity, demonstrates how our approach can capture meaningful structure and increase interpretability and semantic coherence. In short, this comparative study highlights the robustness, scalability and domain adaptability of

S e m a T o p i c

. This is considered an effective substitute for obtaining high-quality and human-interpretable topics from complex textual data.

Figure 13. Topic-term bipartite graph generated by BERTopic on a domain-specific scientific corpus of 946 scientific articles.

Figure 14. The extracted topics by our previous approach (LDA+) on the domain-specific scientific corpus of 946 scientific articles.

In order to empirically evaluate the performance of our proposed approach compared to current state-of-the-art topic modeling methods, we conducted an experiment on the 20 Newsgroups dataset (https://www.kaggle.com/datasets/crawford/20-newsgroups—formed from informal and noisy short text tweets, accessed on 1 July 2025) using three different models:

S e m a T o p i c

, BERTopic and LDA+. This experimental study was used not only for performance comparison, but also for studying how semantic reinforcement mechanisms can improve topic extraction. It also enabled us to see how a difference in quality and quantity of input data may affect the properties of topic modeling.

As shown in Figure 15, although BERTopic integrates semantic clustering using class-based TF-IDF and applies transformer-based embeddings, the generated graph structure is very dense and does not make sense, presenting heavy overlapping topics, redundant key phrases, and little structure separation of topics. This is an indication that the semantic clustering alone, when unaware of the latent structure of the corpus or when combined with weak disambiguation strategies, can cause topic granularity to be inflated without adding distinction to the semantics.

On the contrary, in Figure 16, LDA+ shows that it is able to improve traditional LDA with optimized priors but gives too sparse a topic graph. The associations between topics and terms are weak and even-decoupled, leading to imprecise and low-informative topics. This highlights the limitations of non-contextual models when confronted with short and informal texts, for which term frequency alone is not enough to capture latent topics.

By comparison, the graph produced by our

S e m a T o p i c

model in Figure 17 presents discrete topic clusters, strong term-topic coherence and a clear structure at the global level. These gains result from three key design choices: (1) the use of contextual embeddings for expressing fine-grained semantics at the document level, (2) the use of semantic clustering, where documents that share latent semantic affinity are grouped together, and (3) the adaptive fine-tuning of our model parameters per semantic cluster to maximize topic coherence.

Figure 15. Topic-term bipartite graph generated by BERTopic using the 20 Newsgroups dataset.

Significantly, this advantage is maintained even for the noisy, low-quality nature of the dataset, stripped of any neatly-structured linguistic information. The reported enhancements demonstrate that the quality of the data is a decisive factor in the performance of topic model. Therefore, in the output of our approach, some extracted topic terms seem generic, vague or semantically weak and not due to poor modeling, but rather due to an immediate effect of the restricted degree of lexical richness and semantic density characterizing the available input. However, our approach remains capable of preserving topic separability and semantic coherence in a structural form, surpassing baseline methods (BERTopic [3], LDA+ [10]) in visual clarity and interpretability.

These findings show not only that semantic refinement strategies can be used to contain data noise but also that it is still the quality and expressiveness of input documents that plays a determinant role in topic-term quality.

To substantiate the above findings with concrete evidence, Table 7 presents a qualitative comparison of representative topics extracted by LDA+, BERTopic, and

S e m a T o p i c

on the Google Scholar dataset. This analysis complements the graph-based visualizations by illustrating the actual topic content and highlighting differences in semantic clarity and interpretability.

As observed,

S e m a T o p i c

produces semantically dense and thematically coherent topics, in which terms are conceptually aligned and readily interpretable (e.g., semantic, coherence, representation, topic). In contrast, LDA+ tends to generate overly broad and generic groupings that lack fine-grained semantic detail, while BERTopic—despite its use of transformer-based embeddings—frequently combines loosely related or noisy terms (e.g., airbnb, ukraine, gps, xml, lyft) within a single topic. These findings highlight the superior ability of

S e m a T o p i c

to deliver topics that are both structurally well-organized and semantically faithful, thereby enhancing interpretability compared to existing baselines.

Figure 16. Topic-term bipartite graph generated by LDA+ using the 20 Newsgroups dataset.

Figure 17. Topic-term bipartite graph generated by

S e m a T o p i c

using the 20 Newsgroups dataset.

Figure 17. Topic-term bipartite graph generated by

S e m a T o p i c

using the 20 Newsgroups dataset.

Table 7. Qualitative comparison of representative topics generated by LDA+, BERTopic, and

S e m a T o p i c

on the Google Scholar corpus (965 documents).

Table 7. Qualitative comparison of representative topics generated by LDA+, BERTopic, and

S e m a T o p i c

on the Google Scholar corpus (965 documents).

Model	Topic 1	Topic 2	Topic 3
LDA+	data, information, user, web, media	research, study, service, social, tech	health, music, news, medical, odd
BERTopic	airbnb, ukraine, gps, xml, lyft	ecosystem, schizophrenia, rdf, intake, folate	bash, student, dyslexia, nutrient, taxi
SemaTopic (ours)	semantic, coherence, representation, topic, distribution	retrieval, index, structure, similarity, model	clustering, pattern, fuzzy, detection, accuracy

Quantitative evaluation: In addition to the first qualitative investigation with topic-term bipartite graph representations, we also performed more quantitative semantic coherence-driven validation to add more evidence supporting the effectiveness comparison between the three models. This was done to finally establish the semantic coherence of the extracted topics, by making the $C_{v}$ semantic coherence metric the benchmark measure. Experiments were performed on the 20 Newsgroups dataset, a difficult benchmark because of its noisy, tweet-like nature and small lexical diversity. In order to handle the semantic sparsity of data, the maximum number of candidate topics per cluster was limited to five in the optimization process. It is shown in Figure 18 that Cluster 2 obtains the best topic coherence of 0.5315 under the configuration $K = 2$ , $α = 0.01$ , $β = 0.71$ . In addition, Cluster 3 reaches the maximum semantic coherence of 0.4906 with $K = 2$ , $α = 0.81$ , $β = 0.01$ , as shown in Figure 19. On the other hand, Cluster 1 with 38 documents (Figure 20) shows a semantic coherence score of 1.0 in every configuration. This saturation suggested that the data are not fine-grained enough to separate a wider variety of topics, so Cluster 1 was treated as a monolithic topic under which individual clusters are combined into a single topic model.
$S e m a T o p i c$ outperforms all the baselines in terms of semantic coherence. The BERTopic also utilizes clustering based on semantics and yielded a semantic coherence score of 0.5004, but the reasoning of the topic is just obscure due to the densely twisted graph. In the same way, LDA+ had an optimum semantic coherence value of 0.4729 with $K = 3$ , $α = 0.81$ , $β = 0.81$ , but it failed to take fine-grained semantic context integrated into clustering. By comparison, $S e m a T o p i c$ integrates contextualized embeddings with adaptive Dirichlet parameter learning, which can generate more semantically coherent and discriminative topic structures.
To summarize, this dual mode evaluation, visualization and semantic coherence scoring provides a validation for the robustness and semantic faithfulness of our proposed framework. Despite the noise and structural disadvantages of the 20 Newsgroups corpus, $S e m a T o p i c$ effectively maintains topic separability and meaning, and demonstrates the essential importance of semantic-aware modeling using rich context embeddings in low-resource text analytics.

Figure 18. Semantic coherence optimization for Cluster 2 across topic numbers and Dirichlet priors (K,

α

,

β

).

Figure 18. Semantic coherence optimization for Cluster 2 across topic numbers and Dirichlet priors (K,

α

,

β

).

Figure 19. Semantic coherence optimization for Cluster 3 across topic numbers and Dirichlet priors (K,

α

,

β

).

Figure 19. Semantic coherence optimization for Cluster 3 across topic numbers and Dirichlet priors (K,

α

,

β

).

Figure 20. Semantic coherence optimization for Cluster 1 across topic numbers.

Beyond BERTopic and LDA+, a number of embedding-based and neural models have been introduced in recent years to advance topic modeling. Approaches such as LDA2Vec [12] and the Embedded Topic Model (ETM) [11] integrate static embeddings into probabilistic frameworks, thereby improving semantic cohesion. However, their reliance on context-independent representations reduces robustness, often leading to ambiguous topic assignments in heterogeneous corpora. Similarly, WE-LDA [18] extends LDA with word embeddings and shows promising results in domain-specific settings, yet it inherits the limitations of static embeddings and struggles with scalability. Top2Vec [9], by directly clustering dense document embeddings, generates coherent topics without requiring a predefined topic number; nevertheless, the absence of explicit topic–word distributions limits its interpretability. More sophisticated approaches, including graph-based and neural variational topic models, capture richer semantic dependencies, but they typically incur substantial computational costs and lack transparency, which restricts their applicability in large-scale, real-world contexts.

While coherence (

C_{v}

) remains the most widely adopted metric for assessing topic quality, it does not fully capture other essential dimensions such as stability, interpretability, and scalability. Table 8 therefore summarizes a multidimensional comparison of representative models, highlighting trade-offs across these criteria.

Notably,

S e m a T o p i c

achieves a relative gain of +6.2% in semantic coherence compared to BERTopic (

C_{v} = 0.53

vs.

0.50

), while simultaneously ensuring higher stability and scalability. These results demonstrate its ability to consistently generate semantically coherent and human-interpretable topics, even when applied to heterogeneous scientific corpora.

4.5. Objective 4: Evaluation of Scalability and Multilingual Robustness

Scalability assessment: One of the essential requirements of a topic modeling framework is its ability to scale efficiently to larger corpora without compromising topic quality. To assess this property, we compared the performance of

S e m a T o p i c

on two datasets of different sizes and characteristics: the Google Scholar corpus (965 documents, primarily monolingual) and the CEUR-WS proceedings corpus (1700 documents, multilingual). Execution time was measured for the entire pipeline, which includes embedding generation, dimensionality reduction, clustering, and probabilistic topic inference.

As shown in Table 9, execution time increased from approximately 40 min for 965 documents to 500 min for 1700 documents. This supra-linear growth is mainly explained by the higher computational cost of embedding generation and density-based clustering (HDBSCAN), which become more demanding on heterogeneous corpora. Despite this increase, the pipeline consistently produced stable and interpretable topics, confirming the scalability of

S e m a T o p i c

.

Table 9. Execution time of SemaTopic across different corpora.

Corpus	Documents	Execution Time (min)	Observations
Google Scholar	965	40	Monolingual dataset; relatively fast execution.
CEUR-WS	1700	500	Larger, multilingual dataset; slower but stable processing.

These findings demonstrate that

S e m a T o p i c

is capable of handling corpora beyond one thousand documents while preserving semantic interpretability and coherence. This confirms the scalability of the framework, making it applicable to real-world, large-scale text- mining scenarios.

Multilingual robustness: The results obtained on the CEUR-WS dataset highlight the ability of $S e m a T o p i c$ to maintain semantic consistency across multilingual and heterogeneous clusters. As shown in Table 10, Cluster 0 achieved the highest coherence score (0.60) despite containing a relatively smaller number of documents (166) and a vocabulary size of 4824. Cluster 1, which comprised the largest portion of the dataset (1310 documents, 5970 unique terms), maintained a moderate coherence score of 0.52. Even Cluster 2, with the smallest number of documents (224), preserved a coherence level of 0.47, reflecting a non-trivial degree of semantic stability. Even the lowest coherence (0.47) is comparable or higher than baselines on monolingual corpora, showing that $S e m a T o p i c$ can mitigate cross-lingual noise.

Table 10. Topic coherence and hyperparameter settings for the CEUR-WS corpus.

Table 10. Topic coherence and hyperparameter settings for the CEUR-WS corpus.

Cluster Documents Vocabulary Size Coherence $α$ $β$ K (Topics)
Cluster 0 166 4824 0.60 0.71 0.20 3
Cluster 1 1310 5970 0.52 0.50 0.01 3
Cluster 2 224 3135 0.47 0.11 0.01 3

Overall, these findings indicate that $S e m a T o p i c$ effectively balances coherence and scalability across varying corpus sizes and linguistic diversity. The relatively stable hyperparameter settings ( $α$ , $β$ , and K) across clusters further reinforce the robustness of the probabilistic inference layer, demonstrating that the framework can reliably adapt to multilingual input without compromising topic interpretability.
To complement these quantitative results, we further employ graph-based visualizations to qualitatively examine how multilingual terms (e.g., Russian and Ukrainian words) are integrated within clusters. This analysis provides additional evidence of the model’s ability to capture cross-lingual semantic relationships while preserving topic coherence.
Graph-based visualization: It was employed to qualitatively assess the semantic organization of topics and the integration of multilingual terms (Figure 21, Figure 22 and Figure 23).
This integration of multilingual terminology confirms that $S e m a T o p i c$ captures cross-lingual semantic relations without degrading topic coherence. Such qualitative evidence complements the quantitative results, reinforcing the framework’s scalability and robustness for large and multilingual corpora.
These results (Figure 21, Figure 22 and Figure 23) demonstrate that, unlike many embedding-based baselines that are sensitive to corpus size or language variation, $S e m a T o p i c$ maintains both scalability and cross-lingual robustness. This highlights its potential for real-world deployment in heterogeneous and multilingual text-mining scenarios.

5. Conclusions

This paper introduced

S e m a T o p i c

, a cluster-specific and semantically enriched topic modeling framework that combines contextual embeddings with adaptive LDA-based inference. Through a comprehensive set of quantitative and qualitative evaluations—including coherence metrics, stability analyses, and graph-based visualizations—we demonstrated that

S e m a T o p i c

consistently outperforms classical and neural baselines in terms of coherence, interpretability, and robustness.

Importantly, the framework proved resilient to corpus perturbations (e.g., addition or removal of documents) and scalable to heterogeneous multilingual datasets, confirming its applicability to real-world text-mining scenarios.

Future work will focus on extending

S e m a T o p i c

with domain adaptation capabilities, optimizing runtime efficiency, and exploring broader corpora. These directions aim to further enhance its robustness and practical deployment in diverse NLP applications.

Author Contributions

Conceptualization, A.D. and S.S.; methodology, A.D.; software, A.D.; validation, A.D., S.S. and R.C.; formal analysis, R.C.; investigation, A.D.; resources, A.D.; data curation, A.D.; writing—original draft preparation, A.D., S.S. and R.C.; writing—review and editing, A.D.; visualization, A.D.; supervision, R.C. and S.S.; project administration, A.T. and A.J.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar][Green Version]
Mersha, M.A.; Yigezu, M.G.; Kalita, J. Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms. arXiv 2024, arXiv:2410.00134v1. [Google Scholar][Green Version]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794v1. [Google Scholar][Green Version]
Abdelrazeka, A.; Eida, Y.; Gawisha, E.; Medhata, W.; Hassana, A. Topic Modeling Algorithms and Applications: A Survey. Inf. Syst. 2022, 112, 102131. Available online: https://www.researchgate.net/publication/364447028 (accessed on 1 October 2022). [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Angelov, D. Top2Vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar] [CrossRef]
Drissi, A.; Khemiri, A.; Sassi, S.; Tissaoui, A.; Chbeir, R.; Jemai, A. LDA+: An Extended LDA Model for Topic Hierarchy and Discovery. In Recent Challenges in Intelligent Information and Database Systems, Proceedings of the 14th Asian Conference, ACIIDS 2022, Ho Chi Minh City, Vietnam, 28–30 November 2022; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Dieng, A.B.; Ruiz, F.J.; Blei, D.M. Topic Modeling in Embedding Spaces. 2020. Available online: https://aclanthology.org/2020.tacl-1.29/ (accessed on 1 July 2020).
Moody, C.E. Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec. arXiv 2016. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. Researchgate. 2014. Available online: https://www.researchgate.net/publication/284576917 (accessed on 1 January 2014).
Sia, S.; Dalmia, A.; Mielke, S.J. Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Association for Computational Linguistics. 2020. Available online: https://aclanthology.org/2020.emnlp-main.135/ (accessed on 1 January 2020).
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef]
Van der Linde, A. PCA-Based Dimensionality Reduction. J. Nonparametric Stat. 2003, 15, 77–92. Available online: https://www.tandfonline.com/doi/abs/10.1080/10485250306037 (accessed on 1 October 2021). [CrossRef]
Nguyen, D.Q.; Billingsley, R.; Du, L.; Johnson, M. Improving Topic Models with Latent Feature Word Representations. arXiv 2018, arXiv:1810.06306v1. [Google Scholar] [CrossRef]
Shi, M.; Liu, J.; Zhou, D.; Tang, M.; Cao, B. WE-LDA: A Word Embeddings Augmented LDA Model for Web Services Clustering. In Proceedings of the 2017 IEEE 24th International Conference on Web Services, Honolulu, HI, USA, 25–30 June 2017. [Google Scholar]
Campello, R.J.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining, Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Gold Coast, Australia, 14–17 April 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 160–172. [Google Scholar]
Federico, B.; Silvia, T.; Dirk, H. Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv 2021, arXiv:2004.03974. [Google Scholar] [CrossRef]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2020, arXiv:1909.11942. [Google Scholar] [CrossRef]
Diego, S.U. A Process for Topic Modelling Via Word Embeddings. arXiv 2023, arXiv:2312.03705. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020. [Google Scholar] [CrossRef]
Zhou, D.; Hu, X.; Wang, R. Neural Topic Modeling by Incorporating Document Relationship Graph. arXiv 2020. [Google Scholar] [CrossRef]
Rajendran, B.; Vidya, C.G.; Sanil, J.; Asharaf, S. A Local Explainability Technique for Graph Neural Topic Models. Hum. Centric Intell. Syst. 2024, 4, 53–76. [Google Scholar] [CrossRef]
Li, W.; McCallum, A. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the ACM, ICML ’06: 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 577–584. [Google Scholar]
Ekinci, E.; Omurca, S. NET-LDA: A Novel Topic Modeling Method Based on Semantic Document Similarity. Turk. J. Electr. Eng. Comput. Sci. 2020, 28, 2244–2260. Available online: https://www.researchgate.net/publication/340732115 (accessed on 1 July 2020). [CrossRef]
Cao, Z.; Li, S.; Liu, Y.; Li, W.; Ji, H. A novel neural topic model and its supervised extension. In Proceedings of the ACM, AAAI’15: Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 2210–2216. [Google Scholar]
Miao, Y.; Grefenstette, E.; Blunsom, P. Discovering Discrete Latent Topics with Neural Variational Inference. arXiv 2018, arXiv:1706.00359. [Google Scholar] [CrossRef]
Wu, X.; Nguyen, T.; Luu, A.T. A Survey on Neural Topic Models: Methods, Applications, and Challenges. arXiv 2024, arXiv:2401.15351v2. [Google Scholar] [CrossRef]
Cieslak, M.C.; Castelfranco, A.M.; Roncalli, V.; Lenz, P.H.; Hartline, D.K. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Mar. Genom. 2020, 51, 100723. [Google Scholar] [CrossRef]

Figure 2. Illustration of sub-topic detection in

S e m a T o p i c

.

Figure 2. Illustration of sub-topic detection in

S e m a T o p i c

.

Figure 3. Plate notation diagram of the

S e m a T o p i c

model.

Figure 3. Plate notation diagram of the

S e m a T o p i c

model.

Figure 4. Plate notation diagram of LDA model.

Figure 5. Evaluation of topic overlap using UMAP.

Figure 6. Evaluation of topic overlap using t-SNE.

Figure 10. Jaccard similarity matrix between top-n words of original and perturbed topics.

Figure 11. Cosine similarity matrix between topics of original and perturbed corpus.

Figure 12. Topic Matching Rate (TMR) per cluster after corpus perturbation.

Figure 21. Global topic–word graph for the CEUR-WS corpus.

Figure 22. Local topic–word graphs highlighting multilingual integration within Cluster 2.

Figure 23. Local topic-word graphs highlighting multilingual integration within Cluster 1.

Table 1. Comparative evaluation of existing topic modeling approaches across multiple challenges.

Approach	Challenge 1					Challenge 2
Approach	C1	C2	C3	C4	C5	C6	C7	C8	C9
lda2vec [12] (2016)	–	No	–	Topics	Medium	–	–	No	No
WE-LDA [18] (2017)	Yes	No	–	Topics	Medium	–	–	No	No
Yishu et al. [29] (2018)	Yes	No	–	Topics	Medium	–	–	–	No
Adji et al. [11] (2020)	Yes	No	–	Topics	Medium	–	–	No	No
Top2Vec [9] (2020)	Yes	Yes	–	–	Medium	Yes	–	No	Yes
NET-LDA [27] (2020)	Yes	No	Yes	Topics	Medium	–	–	No	No
BERTopic [3] (2022)	Yes	Yes	–	–	High	Yes	Yes	No	Yes
SemaTopic (Ours)	Yes	Yes	Yes	Topics + Sub-topics	High	Yes	Yes	Yes	Yes

Table 2. Variable definitions in the plate notation diagram of the

S e m a T o p i c

model.

Table 2. Variable definitions in the plate notation diagram of the

S e m a T o p i c

model.

Symbol	Definition
$α$	Dirichlet prior for topic distribution per document
$θ_{d}$	Topic distribution for document d, drawn from $Dirichlet (α)$
$z_{d}$	Cluster-level topic assignment for document d
$z_{d, i}$	Topic assignment for the i-th word in document d
$ϕ_{k}$	Word distribution for topic k
$w_{i}$	Observed word at position i in document d
e	BERT-based embeddings for document d
$C_{j}$	Document cluster produced by HDBSCAN
T	Number of topics
W	Vocabulary size

Table 3. Variable definitions in the plate notation diagram of the LDA model.

Symbol	Definition
$α$	Dirichlet prior for topic distribution per document
$θ$	Topic distribution for a document, sampled from $Dirichlet (α)$
$z_{i}$	Topic assignment for the i-th word in the document
$φ$	Word distribution for topic k, drawn from $Dirichlet (β)$
$β$	Dirichlet prior for word distribution per topic
$w_{i}$	Observed word at position i
T	Number of latent topics
$\| W \|$	Vocabulary size

Table 4. Pipeline stages of

S e m a T o p i c

with implementation details, fixed parameters, and execution time breakdown (Google Scholar corpus).

Table 4. Pipeline stages of

S e m a T o p i c

with implementation details, fixed parameters, and execution time breakdown (Google Scholar corpus).

Stage	Implementation Details	Parameters (Values)	Execution Time (min)
Preprocessing + NER	spaCy, HuggingFace Transformers	–	–
Embeddings	SentenceTransformers (SciBERT model)	–	2.5
UMAP	Dimensionality reduction	`n_neighbors=15`, `min_dist=0.1`, `dim=10`	3.2
HDBSCAN	Density-based semantic clustering	`min_cluster_size=15`, `min_samples=5`	10.1
LDA grid (per cluster)	Extended LDA with coherence optimization	`top-N=10`, `window=20`, search over $(α, β, K)$	24.2
Stability analysis	Topic robustness filtering	$τ = 0.7$	–

Table 8. Multidimensional comparison of SemaTopic and representative state-of-the-art topic models on the Google Scholar dataset (965 documents), combining semantic coherence (

C_{v}

) with stability, interpretability, and scalability.

Table 8. Multidimensional comparison of SemaTopic and representative state-of-the-art topic models on the Google Scholar dataset (965 documents), combining semantic coherence (

C_{v}

) with stability, interpretability, and scalability.

Model	Coherence ( $C_{v}$ )	Stability	Interpret.	Scalability
LDA+ [10]	0.47	Medium	High	High
LDA2Vec [12]	0.43	Low	Medium	Medium
ETM [11]	0.42	Low	Medium	Medium
Top2Vec [9]	0.48	Medium	Low	Medium
BERTopic [3]	0.50	Medium	Medium	Low
SemaTopic (ours)	0.53	High	High	High

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Drissi, A.; Sassi, S.; Chbeir, R.; Tissaoui, A.; Jemai, A. SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling. Computers 2025, 14, 400. https://doi.org/10.3390/computers14090400

AMA Style

Drissi A, Sassi S, Chbeir R, Tissaoui A, Jemai A. SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling. Computers. 2025; 14(9):400. https://doi.org/10.3390/computers14090400

Chicago/Turabian Style

Drissi, Amani, Salma Sassi, Richard Chbeir, Anis Tissaoui, and Abderrazek Jemai. 2025. "SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling" Computers 14, no. 9: 400. https://doi.org/10.3390/computers14090400

APA Style

Drissi, A., Sassi, S., Chbeir, R., Tissaoui, A., & Jemai, A. (2025). SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling. Computers, 14(9), 400. https://doi.org/10.3390/computers14090400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Contextual Semantic Representation

3.2. Coherent Probabilistic Topic Extraction

3.3. Topics Stability and Robustness Analysis

4. Experiments

4.1. Environment

4.2. Execution Time by Stage

4.3. Experimental Protocol

4.3.1. Objective 1: Topic Quality and Semantic Expressiveness Evaluation

4.3.2. Objective 2: Robustness and Stability Assessment

4.3.3. Objective 3: Comparative Analysis with State-of-the-Art Methods

4.3.4. Objective 4: Evaluation of Scalability and Multilingual Robustness

4.4. Evaluation Results

4.4.1. Objective 1: Topic Quality and Semantic Expressiveness Evaluation

4.4.2. Objective 2: Robustness and Stability Assessment

4.4.3. Objective 3: Comparative Analysis with State-of-the-Art Methods

4.5. Objective 4: Evaluation of Scalability and Multilingual Robustness

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Cluster	Number of Topics (K)	Topic-Document Prior ( $α$ )	Topic–Word Prior ( $β$ )
Cluster 0	2	0.81	0.01
Cluster 1	2	0.92	0.21
Cluster 2	2	0.87	0.41
Cluster 3	2	0.91	0.01

Cluster ID	Number of Documents	Words Count
Cluster 0	89	6795
Cluster 1	31	2011
Cluster 2	97	6820
Cluster 3	128	6347

Cluster	Documents	Vocabulary Size	Coherence	$α$	$β$	K (Topics)
Cluster 0	166	4824	0.60	0.71	0.20	3
Cluster 1	1310	5970	0.52	0.50	0.01	3
Cluster 2	224	3135	0.47	0.11	0.01	3