Convex Hull-Based Topic Similarity Mapping in Multidimensional Data

Pohorenec, Matúš; Vavrák, Vladislav; Behúnová, Annamária; Behún, Marcel; Ennert, Michal

doi:10.3390/info17020180

Open AccessArticle

Convex Hull-Based Topic Similarity Mapping in Multidimensional Data

by

Matúš Pohorenec

¹

,

Vladislav Vavrák

²

,

Annamária Behúnová

^2,*

,

Marcel Behún

³

and

Michal Ennert

⁴

¹

Faculty of Civil Engineering, Institute of Construction Technology, Economics and Management, Technical University of Kosice, 042 00 Kosice, Slovakia

²

Faculty of Mining, Ecology, Process Control and Geotechnologies, Institute of Logistics and Transport, Technical University of Kosice, 042 00 Kosice, Slovakia

³

Faculty of Mining, Ecology, Process Control and Geotechnologies, Institute of Earth Resources, Technical University of Kosice, 042 00 Kosice, Slovakia

⁴

Department of Development, Operation and Integration of Information Systems, Institute of Computer Technology, Technical University of Kosice, 042 00 Kosice, Slovakia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(2), 180; https://doi.org/10.3390/info17020180

Submission received: 4 November 2025 / Revised: 28 January 2026 / Accepted: 31 January 2026 / Published: 10 February 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This research presents a large-scale thematic analysis of 66,002 Slovak university thesis abstracts, aimed at identifying, categorizing, and visualizing research trends across multiple academic disciplines. Using BERTopic for unsupervised topic modeling with K-Means clustering, 3000 distinct thematic clusters were extracted through rigorous coherence optimization, with each topic characterized by representative keywords derived from class-based TF-IDF weighting. Text embeddings were generated using SlovakBERT-STS, a domain-adapted Slovak BERT model fine-tuned for semantic textual similarity, producing 768-dimensional vectors that enable precise computation of cosine similarity between topics, resulting in a 3000 × 3000 topic similarity matrix. The optimal topic count was determined through systematic evaluation of K values ranging from 1000 to 10,000, with K = 3000 identified as the optimal configuration based on coherence elbow analysis, yielding a mean coherence score of 0.433. Thematic relationships were visualized through Multidimensional Scaling (MDS) projection to 3-D space, where convex hull geometries reveal semantic boundaries and topic separability. The methodology incorporates dynamic stopword filtering, Stanza-based lemmatization for Slovak morphology, and UMAP dimensionality reduction, achieving a balanced distribution of approximately 22 abstracts per topic. Results demonstrate that fine-grained topic models with 3000 clusters can extract meaningful semantic structure from multi-domain, morphologically complex Slovak academic corpora, despite inherent coherence constraints. The reproducible pipeline provides a framework for large-scale topic discovery, coherence-driven optimization, and geometric visualization of thematic relationships in academic text collections.

Keywords:

artificial intelligence; topic modeling; BERTopic; natural language processing; Slovak language processing; multidimensional scaling; convex hulls; coherence optimization

Graphical Abstract

1. Introduction

Automatic extraction and visualization of topics from large text corpora is a key tool for content analysis and identifying relationships between documents. In the case of aca-demic texts, such as theses, such analysis can reveal dominant areas of research, their thematic connections, and marginal, specialized directions. Traditional topic modeling methods, such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF), provide a basis for decomposing texts into latent topics [1,2], but often struggle with low coherence in short and heterogeneous texts [3]. Modern embedding approaches, based on transformer architectures, allow the semantics of documents to be modeled in high-dimensional spaces and achieve more accurate clustering of con-tent-related texts. In the research conducted, we analyze 66,002 abstracts of Slovak bachelor’s, master’s, and doctoral theses stored in plain text format with associated metadata. Since predefined thematic categories were not available, we applied an unsupervised approach combining transformer-based embeddings and modern clustering methods. The core of the solution is BERTopic [4], which uses the monolingual sentence transformer SlovakBERT STS fine-tuned for semantic textual similarity, supplemented by adaptive stop-word filtering, UMAP dimensionality reduction to 15 dimensions, and K-Means clustering with K systematically optimized from 1000 to 10,000 topics via coherence elbow analysis. The optimal configuration with K = 3000 topics was selected based on peak improvement rate in topic coherence (C_v = 0.433), yielding approximately 22 abstracts per topic. The output of the process is a set of 3000 assigned topic clusters characterized by class-based TF-IDF keywords, along with an interactive 3D visualization in the form of Multidimensional Scaling (MDS) projection, which displays convex hulls of topics and their mutual semantic distances in three-dimensional space [5]. This map allows us to identify not only core, frequently occurring topics, but also marginal, highly specialized areas. The representation obtained in this way provides a basis for subsequent quantitative analysis of similarities between topics and comparison of the thematic structure of academic work in the Slovak environment. For example, construction/BIM emerges as a coherent topic family in our corpus, aligning with broader findings on the role of ICT/BIM in building life-cycle costs [6]. Beyond construction, several other domain clusters are likewise well-documented in external literature: (i) transporta-tion/logistics and rail infrastructure [7] together with composites/materials engineering [8,9]; (ii) BIM adoption and economic sustainability in construction projects [10,11,12]; (iii) in-dustrial/digital engineering, including TestBed 4.0 and Tecnomatix Plant Simulation use-cases [13,14,15]; and (iv) aerospace and aviation safety with turbine analysis [16,17,18]. The remainder of this paper is organized as follows: Section 2 provides a comprehensive liter-ature review of topic modeling methods, transformer-based embeddings, and convex hull visualization techniques, identifying critical gaps that this research addresses [19]. Section 3 describes the experimental research methodology, including dataset characteristics (Sec-tion 3.1), the preprocessing pipeline with lemmatization and dynamic stopword filtering (Section 3.2), the BERTopic-based labeling strategy with K-Means clustering (Section 3.3), topic quality metrics and coherence evaluation (Section 3.4), and convex hull construction for geometric visualization (Section 3.5). Section 4 presents the research results, encom-passing K-selection analysis for optimal topic count determination (Section 4.1), full-run coherence evaluation (Section 4.2), convex hull visualization interpretation (Section 4.3), and a discussion of methodological limitations and comparisons with existing literature (Section 4.4). Finally, Section 5 concludes the paper with a summary of contributions and directions for future work.

2. Literature Review

Topic modeling and document clustering have been widely used for uncovering la-tent structures in large text corpora. Traditional methods such as Latent Dirichlet Alloca-tion (LDA) and NonNegative Matrix Factorization (NMF) model topics as probabilistic mixtures of words, offering interpretable representations for corpora of moderate length. However, these methods often suffer [20] from reduced coherence when applied to short and domain-diverse texts, such as abstracts. The lack of contextual word representations in these approaches limits their ability to capture nuanced semantic relations, especially in multilingual or morphologically rich languages like Slovak [21,22]. Recent advances in transformer-based embedding models, such as BERT, mBERT, and XLM-R [23], address this limitation by representing text in high-dimensional semantic spaces. Sentence-BERT (SBERT) enables efficient semantic similarity computation by producing meaningful sen-tence embeddings, vastly improving clustering tasks [24]. Embedding-based topic models further improve topic coherence for short documents [25], as demonstrated in ETM and other neural approaches [22,26]. BERTopic integrates document embeddings (from trans-formers), clustering, and a novel class-based TF-IDF (c-TF-IDF) technique to extract coher-ent and interpretable topic representations. It consistently outperforms classical models in topic coherence and adaptability across domains and short texts [27]. BERTopic’s archi-tecture embeddings → clustering (e.g., HDBSCAN [28]) → c-TF-IDF is both modular and robust, making it a strong choice for Slovak academic abstracts. Most multilingual models (mBERT, XLM-R) cover Slovak superficially. SlovakBERT is the first monolingual Slovak transformer, achieving state-of-the art results in tasks like semantic textual similarity (STS). The SlovakBERT STS model, built specifically for sentence similarity, demonstrates high correlation performance on the Slovak STS benchmark far outperforming generic multi-lingual models [29,30]. Visualizing topic clusters using geometric constructs like convex hulls is underexplored in text analysis. In natural language processing and embed-ding-space analysis, convex hulls have been used to quantify dispersion and boundary delineation, revealing geometric uncertainty [31]. In topic-modeling visualization, convex hulls computed from embedded points improve interpretability and provide explicit the-matic boundaries beyond centroid-only approaches [32]. Gaps in the field remain in three areas: multilingual sensitivity [33]—many Slovak text-clustering studies do not exploit modern multilingual embedding models, resulting in weak handling of code-switching, citations, and foreign terminology; boundary definition between topics—conventional visualizations focus on centroids without outlining explicit thematic extents, reducing interpretability; and domain adaptive preprocessing—stop-word lists are rarely tailored to Slovak academic discourse, allowing boilerplate terms to dilute topic distinctiveness. The proposed approach addresses these gaps by combining SlovakBERT STS embeddings with BERTopic’s adaptive stop-list filtering, K-Means clustering with coherence-driven optimization, and convex hull visualization. This ensures high semantic accuracy, clear topic delineation, and adaptive noise reduction. This research fills the identified gaps in the existing literature by providing a comprehensive, reproducible framework specifically designed for large-scale topic modeling in morphologically complex, non-English aca-demic corpora. Unlike prior studies that apply generic multilingual models to Slovak text, our approach leverages SlovakBERT-STS, a domain-adapted monolingual model that captures semantic nuances specific to the Slovak language. Furthermore, while existing topic visualization methods rely solely on centroid-based representations, this study introduces convex hull geometries to delineate explicit thematic boundaries, thereby enhancing interpretability and enabling researchers to visually assess topic separability and overlap. Finally, the dynamic stopword filtering mechanism based on document frequency thresholds addresses the limitation of static word lists, which fail to adapt to the termino-logical characteristics of multi-domain academic discourse. By addressing these three critical gaps simultaneously, this research contributes a novel methodology that advances the state of the art in Slovak-language topic modeling and provides a transferable pipeline applicable to other morphologically rich languages.

3. Experimental Research

The purpose of the experimental research is to evaluate the applicability of unsupervised topic modeling and semantic clustering methods on a large-scale corpus of Slovak academic texts. This section describes the dataset used in our experiments, the preprocessing procedures applied, and the methodological choices made to ensure valid and interpretable results. By analyzing real abstracts of theses, we aim to uncover recurring themes, assess their contextual similarity, and provide a structured overview of research directions within the Slovak academic environment.

3.1. Dataset Characteristics

The primary dataset consists of Slovak university thesis abstracts extracted from a semicolon-delimited CSV file containing 120,034 raw entries. The original dataset structure includes five fields: ‘Id’, ‘ThesisId’, ‘AuthorId’, ‘Title’, and ‘Abstract’, where the ‘Title’ field represents keywords or terms associated with each thesis rather than the actual thesis title. After preprocessing, the dataset was reduced to 66,002 unique thesis abstracts through deduplication operations that removed redundant entries sharing identical ‘ThesisId’, ‘AuthorId’, and ‘Abstract’ combinations while aggregating associated keywords. The final dataset represents a comprehensive corpus of Slovak academic research spanning multiple university faculties and disciplines. The dataset exhibits several structural characteristics that constrain downstream analysis. All abstracts are written in Slovak, a morphologically rich Slavic language with limited NLP tooling compared to English. The corpus aggregates theses from at least 9 university faculties, introducing significant topical and terminological diversity that fundamentally limits the achievable coherence in topic modeling, as semantic relationships across disparate domains such as engineering, humanities, and natural sciences are inherently weaker than within-domain relationships. While year-of-completion data was available in the file `year_of_completion.csv`, it was not utilized in topic modeling and was reserved for downstream visualization and trend analysis. The year metadata was cleaned by removing entries with NULL or invalid year values, standardizing date formats to four-digit years, and merging the year data with thesis records via the ‘ThesisId’ foreign key. Topic metadata cleaning involved removing topics with fewer than 4 assigned abstracts (insufficient for convex hull computation), filtering out empty keyword strings, and normalizing keyword separators to consistent comma-delimited format. This cleaning process ensured data integrity for subsequent geometric visualization and temporal trend analysis. The dataset does not include thesis titles, full text, departmental affiliations, or research field classifications, meaning that topic interpretability relies exclusively on abstract content. The distribution of theses across years is not uniform. As shown in Table 1, the dataset grows significantly after 2006, with peak production between 2010 and 2013, when over 5000 theses were submitted annually. In more recent years, the yearly count stabilizes at approximately 2500–2800 entries. The presence of a small number of incomplete or unclassified entries (e.g., YearOfCompletion = NULL) is noted, but these do not significantly affect the overall representativeness of the corpus.

3.2. Preprocessing

The preprocessing pipeline implements a three-stage transformation from raw text to model-ready representations. In the first stage, raw entries are loaded from the original semicolon-separated CSV file. The header-less file is parsed with explicit column assignment as ‘[‘Id’, ‘ThesisId’, ‘AuthorId’, ‘Title’, ‘Abstract’]’. Entries with identical ‘ThesisId’, ‘AuthorId’, and ‘Abstract’ are grouped, and their ‘Title’ fields, which represent keywords, are concatenated with comma separation. This title aggregation operation eliminates redundant abstract duplicates while preserving associated keyword metadata. Title fields with null values are filled with empty strings and converted to string type to prevent parsing errors. The cleaned dataset is exported to ‘cleaned_dataset.csv’, reducing the corpus from 120,034 to 66,002 abstracts. In the second stage, the cleaned abstracts undergo linguistic processing using Stanza, a neural pipeline for Slovak language processing. The Stanza Slovak model is loaded to provide part-of-speech tagging, dependency parsing, and lemmatization. Each abstract is processed sentence-by-sentence, word-by-word, with the lemma or base form of each word extracted to reduce inflected forms to their dictionary entries. For example, “počítačov” (genitive plural of “computer”) is lemmatized to “počítač” (nominative singular). Lemmatized tokens are stored in a ‘Tokens’ column and exported to ‘tokenized_dataset.csv’. Lemmatization is critical for morphologically rich languages like Slovak, where a single concept may surface in dozens of inflected forms. Without this step, topic models would fragment semantically equivalent terms across multiple word forms, degrading topic coherence and interpretability. In the third stage, the pipeline implements dynamic stopword filtering using document frequency thresholds, as static stopword lists are insufficient for multi-domain corpora where domain-specific common words such as “analýza” (analysis) or “práca” (work) may dominate topics without conveying semantic content. Words appearing in more than 40% of abstracts are excluded through high-frequency filtering, as these typically represent generic academic terms like “thesis”, “research”, or “study” that provide minimal discriminative value. Conversely, words appearing in fewer than 0.2% of abstracts, approximately 132 documents, are excluded through low-frequency filtering. These low-frequency terms are predominantly rare technical jargon, typos, or proper nouns that introduce noise without contributing to stable topics. The filtering is implemented via ‘CountVectorizer’ with parameters ‘max_df = 0.4’ and ‘min_df = 0.002’, allowing the system to dynamically identify stopwords based on empirical frequency distributions rather than predefined lists. This frequency filtering directly affects topic quality, as overly aggressive filtering reduces vocabulary size but may eliminate domain-specific terms that define topics, while insufficient filtering allows generic terms to dominate topic representations. The chosen thresholds balance vocabulary size with semantic specificity.

3.3. Labeling Strategy

Topic labeling is performed using BERTopic, a modular topic modeling framework combining transformer-based embeddings, dimensionality reduction, clustering, and term weighting. The pipeline implements a supervised K-Means clustering strategy with K predetermined via coherence optimization. Abstracts are encoded using ‘kinit/slovakbert-sts-stsb’, a Slovak BERT model fine-tuned for semantic textual similarity, which produces 768-dimensional dense vectors capturing semantic relationships in Slovak text. To accelerate iterative experimentation, embeddings are computed once and cached to disk in ‘Labeling/embeddings_cache/’, allowing subsequent runs to load cached embeddings and bypass GPU computation, reducing runtime from hours to minutes. The pipeline detects available GPU hardware, either NVIDIA CUDA or AMD DirectML, and offloads embedding computation to the GPU with a batch size of 16 to balance throughput with VRAM constraints. The 768-dimensional embeddings are projected to 15 dimensions using Uniform Manifold Approximation and Projection (UMAP) with several carefully chosen configuration parameters. The ‘n_neighbors’ parameter is set to 30 to control the balance between local and global structure preservation, with this higher value emphasizing global manifold structure and producing smoother clusters suitable for K-Means. The target dimensionality ‘n_components’ is set to 15, as lower dimensionality improves cluster separability by reducing noise dimensions. The ‘min_dist’ parameter is set to 0.1, providing minimum spacing between points in the low-dimensional space, where small values permit tighter clusters. The distance metric is set to cosine, which is appropriate for normalized embeddings. UMAP is preferred over PCA because it preserves local neighborhood structure, which is critical for identifying semantically coherent topic clusters, and the reduced dimensionality of 15 dimensions mitigates the curse of dimensionality for K-Means clustering. Topics are discovered using K-Means clustering in the 15-dimensional UMAP space, with the optimal number of clusters K determined via topic coherence optimization as described in the K-Selection Analysis section. The clustering algorithm is configured with the number of clusters ‘n_clusters’ set to 3000 based on coherence elbow analysis. Intelligent centroid initialization is achieved through the k-means++ algorithm to improve convergence, and the algorithm performs 5 initializations with different seeds, selecting the result with the lowest inertia. The maximum number of iterations for convergence is set to 300, and a fixed random seed of 42 ensures reproducibility. While the pipeline supports both K-Means with fixed K and HDBSCAN with automatic K selection and outlier detection, K-Means was selected for the final model for several reasons. K-Means assigns every document to a topic, maximizing dataset coverage without outliers, provides an interpretable fixed K value of 3000 that can be rigorously justified via coherence optimization, and offers superior computational efficiency compared to HDBSCAN for large K values. BERTopic generates topic representations using class-based TF-IDF (c-TF-IDF), a modification of TF-IDF where each topic is treated as a single document composed of all abstracts assigned to that topic. The top 10 highest-weighted terms define each topic’s keyword representation. The `CountVectorizer’ applies dynamic stopword filtering with parameters max_df = 0.4 and min_df = 0.002 during c-TF-IDF computation, ensuring that generic high-frequency and rare low-frequency terms do not dominate topic representations.

3.4. Labeling Quality Metrics

The Topic quality is evaluated using topic coherence (C_v), a metric quantifying the semantic interpretability of topic keyword sets. Topic coherence C_v measures the degree of semantic similarity between high-scoring words in a topic through a sliding window over the corpus to estimate word co-occurrence probabilities, followed by normalized pointwise mutual information (NPMI) aggregation. The computation procedure extracts the top 10 words ranked by c-TF-IDF weight for each topic, then splits all abstracts into whitespace-delimited lowercased tokens. A Gensim dictionary is built from tokenized documents, filtering extremes to exclude terms appearing in fewer than 5 documents or more than 50% of documents. Gensim’s ‘CoherenceModel’ with the C_v metric is then used to compute scores for each topic and aggregate them across all topics. The interpretation scale for C_v ranges from 0.0 to 1.0, where values between 0.0 and 0.3 indicate poor coherence with incoherent or random topics, values between 0.3 and 0.5 indicate moderate coherence with some semantic structure, values between 0.5 and 0.7 indicate good coherence with interpretable and semantically consistent topics, and values between 0.7 and 1.0 indicate excellent coherence with highly coherent domain-specific topics. However, achievable coherence in this corpus is constrained by several factors. Topics spanning unrelated domains such as machine learning and metallurgy inherently exhibit lower coherence than within-domain topics due to multi-domain heterogeneity. Limited linguistic resources and the absence of advanced lemmatization for Slovak language processing reduce co-occurrence precision. Furthermore, fine-grained topic models with K = 3000 produce narrower topics that are more susceptible to data sparsity and lower coherence. For this corpus, coherence values in the range 0.40–0.55 are acceptable and align with multilingual, multi-domain academic corpora reported in the literature. The coherence values computed during K-selection are systematically lower than the final model’s coherence due to methodological differences in the two processes. K-selection coherence, ranging from 0.39 to 0.43, uses a 15,000-document subsample rather than the full 66,002 documents, employs fast K-Means with minimal initialization of n_init = 1 and max_iter = 100, and extracts topic keywords via simple frequency counting rather than c-TF-IDF. In contrast, the final model coherence with a mean of 0.43 and median of 0.37 uses all 66,002 documents, employs the full BERTopic pipeline with optimized K-Means parameters of n_init = 5 and max_iter = 300, and extracts topic keywords via c-TF-IDF, which produces more discriminative term sets. The K-selection coherence serves as a comparative metric for identifying the optimal K rather than as an absolute quality measure, and the final model’s coherence is the authoritative metric for publication.

3.5. Convex Hull Construction

Convex hull visualization provides a geometric interpretation of topic relationships in three-dimensional space through the construction and analysis of a pairwise topic similarity matrix. A similarity matrix is computed for all topics using cosine similarity between topic embeddings, where each topic is represented by its keyword string consisting of the comma-separated list of top c-TF-IDF terms. These strings are encoded using the same SlovakBERT model used for abstract embedding, producing 768-dimensional topic vectors. Prior to similarity computation, all embedding vectors are L2-normalized (unit norm standardization), ensuring that cosine similarity values fall within the range [−1, 1]. The resulting similarity matrix is a symmetric K × K matrix, measuring 3000 × 3000 for the final model, where element (i,j) represents the cosine similarity between topics i and j. Matrix symmetry is inherently enforced by the mathematical properties of cosine similarity: since cos(θ) between vectors A and B equals cos(θ) between vectors B and A, the similarity matrix satisfies S(i,j) = S(j,i) by construction. This symmetry is verified computationally by asserting that the maximum absolute difference between S and its transpose is below machine precision (<10⁻¹⁰). This matrix is exported to `topic_similarity_matrix.csv`. To visualize topic relationships, the similarity matrix is projected to three dimensions using Multidimensional Scaling (MDS). The similarity matrix is first converted to a distance matrix through the transformation distance = 1—similarity. MDS is then applied with n_components = 3 and dissimilarity = ‘precomputed’ to embed topics into 3D space such that inter-topic distances approximate the original distance matrix. Each topic is assigned (x, y, z) coordinates, with topics of high similarity and low distance positioned close together, while dissimilar topics are positioned far apart. For each topic, the abstracts assigned to that topic are located in the 3D MDS space using their topic coordinates, where all abstracts in a topic share the topic’s (x,y,z) position with optional jitter added for visualization. A convex hull is computed as the smallest convex polyhedron enclosing all points assigned to the topic. The implementation renders these convex hulls using Plotly’s (v5.20) ‘Mesh3d’ with the parameter ‘alphahull = 0’, which computes the convex hull of the input point set, and hull opacity is set to 0.2 for visual clarity. The geometric properties of the hulls provide interpretable information about topic structure. Hull volume is proportional to the number of abstracts in the topic, as more abstracts create a larger point cloud and consequently a larger hull. Hull overlap indicates topic similarity, with overlapping hulls suggesting topics that share semantic content or have ambiguous boundaries. Conversely, topics with non-overlapping hulls are semantically distinct and isolated in the semantic space. Topics with very few abstracts, specifically fewer than four, cannot form a 3D convex hull and are visualized as scatter points only. Convex hulls are appropriate for topic geometry visualization for several reasons. The hull boundary provides an interpretable representation of the spatial extent of a topic in semantic space. The method is computationally efficient and does not require parameter tuning, unlike alternatives such as kernel density estimation or Gaussian mixture models. Furthermore, convex hulls are robust to outliers within a topic cluster, accommodating them without geometric distortion. However, it should be noted that convex hulls assume topics occupy convex regions in semantic space, which may not hold for complex, multi-modal topics. Nevertheless, for coarse visualization of topic separability, convex hulls provide an intuitive and effective geometric representation.

4. Research Results

This section presents and interprets the empirical outcomes of the study, focusing on the effects of preprocessing, labeling strategies, and topic geometry construction. It systematically evaluates the quality and stability of topic labeling across different coherence-driven configurations, with particular attention to the k-section analysis and the selection of the optimal labeling parameter. The section further analyzes the final labeling results obtained from the full k = 3000 run and examines how these results manifest in the geometric structure of topics through convex hull construction, highlighting observed patterns, separability, and structural relationships grounded in the reported plots and tabulated outputs.

4.1. Labeling Evaluation and K-Selection Analysis

The optimal number of topics (K) was determined by systematically testing K values from 1000 to 10,000 in increments of 500 and evaluating topic coherence (C_v) for each configuration (Table 2). The results reveal a clear pattern in how coherence evolves with increasing topic granularity, as illustrated in Figure 1. At K = 1000, the baseline coherence was 0.3893 with approximately 66 documents per topic. Increasing to K = 1500 produced only minimal gain of +0.19% in coherence improvement. The coherence began accelerating at K = 2000 with a +2.45% improvement, and this acceleration continued through K = 2500 with a +3.92% improvement, placing this configuration near peak performance with approximately 26 documents per topic. The improvement rate peaked at K = 3000, showing a +4.17% coherence gain and yielding an absolute coherence of 0.4000 with approximately 22 documents per topic (Figure 1b). This represents the maximum rate of improvement observed in the entire K-selection range. Beyond this point, the improvement rate declined: K = 3500 showed +3.35% improvement, K = 4000 showed +2.38% improvement representing clear diminishing returns, and subsequent values continued this trend with K = 4500 at +2.98%, K = 5000 at +2.34%, and K = 6000 at +3.32% (Table 2). As K increased further to K = 7000 through K = 10,000, the improvements became progressively smaller, ranging from +2.57% down to +1.35%, while the average documents per topic dropped to between 9 and 6. At these high K values, topics become too granular for meaningful interpretation, as clusters with only 6–7 abstracts represent quasi-random groupings rather than semantically coherent topics. The improvement rate analysis reveals that the optimal region lies at K = 2500 to K = 3000, where the rate of coherence improvement peaks. Examining the cumulative gains across different ranges illustrates this pattern clearly: from K = 1000 to K = 2000, the coherence increased by 0.62% representing a total improvement of 3.05 units, while from K = 2000 to K = 3000, coherence increased by 2.07% representing a total improvement of 8.09 units, marking the peak improvement period. Beyond this peak, from K = 3000 to K = 4000, the increase was only 1.44% with a total improvement of 5.73 units, demonstrating a clear decline. The subsequent range from K = 4000 to K = 10,000 showed a 6.70% increase with 27.21 total improvement distributed over six steps, but this represents progressively diminishing marginal returns. The statistical interpretability threshold provides another critical constraint on K selection. At K = 3000, each topic contains approximately 22 abstracts on average, which represents the minimum cluster size for statistically meaningful semantic groupings. Beyond K = 5000, topics average fewer than 15 abstracts, which are insufficient for robust topic representation with stable term distributions. At K = 10,000, topics contain only approximately 6 abstracts each, and these cannot be considered true topics in the semantic sense but rather represent small groups of similar documents without adequate statistical support. The choice of avoiding higher K values is therefore based on the principle that a topic with only 6 documents lacks the sample size necessary for meaningful semantic interpretation. The coherence curve exhibits a logarithmic growth pattern (Figure 1a), approaching an asymptotic ceiling around 0.43 to 0.44. This ceiling is imposed by fundamental characteristics of the corpus, including multi-domain heterogeneity where the aggregation of disparate academic fields creates limited semantic overlap, Slovak language constraints arising from limited NLP resources that reduce co-occurrence precision, and dataset size limitations where 66,002 abstracts are insufficient to support 10,000 topics with high coherence. To prevent exhaustive search when the coherence curve plateaus, the K-selection algorithm implements early stopping such that if coherence does not improve for 10 consecutive K values, the search terminates automatically. The selection of K = 3000 as the optimal configuration is justified by three converging criteria that together provide strong evidence for this choice. First, the elbow in improvement rate demonstrates that the coherence improvement rate peaks at K = 3000 and declines thereafter, indicating diminishing returns for additional topic granularity. Second, the statistical significance criterion is satisfied because each topic at K = 3000 contains approximately 22 abstracts, which is sufficient for stable semantic representation and reliable term weighting in the c-TF-IDF process. Third, the interpretability balance is achieved because topics at K = 3000 are neither too coarse, which would lose fine-grained semantic distinctions, nor too granular, which would fragment meaningful topics into noise clusters. While higher K values such as K = 6000 or K = 10,000 yield marginally higher absolute coherence scores, they do so at the substantial cost of interpretability and statistical validity. Therefore, K = 3000 represents the optimal trade-off between topic granularity and coherence, maximizing both the semantic quality and the practical interpretability of the resulting topic model.

4.2. Full Run Analysis (K = 3000)

The final BERTopic model with K = 3000 was trained on all 66,002 abstracts. Per-topic coherence scores reveal a distribution with a mean coherence of 0.433 and a median coherence of 0.371 across 2999 topics (Table 2), with one topic potentially excluded due to being empty.

The quality distribution across coherence thresholds is summarized in Table 3 and visualized in Figure 2c. The coherence distribution is visualized in Figure 2, showing the characteristic right-skewed pattern. The coherence range spans from a minimum of 0.082 to a maximum of 1.0, with Topic 2100 achieving perfect coherence (Table 4). The mean coherence of 0.433 falls into the moderate coherence range of 0.3 to 0.5 according to standard C_v interpretation guidelines, which is expected for a multi-domain, non-English corpus with fine-grained topic structure (Figure 2a). The median coherence of 0.371 is notably lower than the mean, indicating a right-skewed distribution with a long tail of low-coherence topics (Figure 3). This distribution pattern is consistent with topic models where a subset of highly coherent, domain-specific topics drives the mean upward, while many topics exhibit moderate or low coherence due to semantic ambiguity or data sparsity in their respective clusters.

The top-performing topics with C_v scores exceeding 0.9 represent highly specialized domains with consistent terminology and clear semantic boundaries (Table 5, Figure 2d). Examples include Topic 2100 with perfect coherence of 1.0, likely corresponding to a narrow technical domain, followed by Topic 1183 with 0.985, Topic 1332 with 0.982, Topic 3 with 0.979, and Topic 1828 with 0.978. These high-coherence topics likely correspond to well-defined research areas with minimal terminological ambiguity, such as specific engineering subfields or medical specializations where technical vocabulary is highly consistent and domain-specific. Conversely, the bottom-performing topics with C_v scores below 0.15 represent semantically diffuse or poorly defined clusters (Table 5).

Examples include Topic 2942 with C_v of 0.082, Topic 1022 with 0.089, Topic 2787 with 0.108, Topic 1216 with 0.110, and Topic 2960 with 0.123. These low-coherence topics may result from several factors: multi-domain clusters that group abstracts from unrelated fields due to superficial lexical overlap, small cluster size where topics with few abstracts lack sufficient data for stable term weighting, or dominance by generic terminology where non-discriminative terms passed through the stopword filtering process but do not provide semantic coherence. The final model assigns all 66,002 abstracts to topics without producing any outliers, as K-Means clustering by design guarantees full coverage of the dataset. The model produces 2999 distinct topics with zero documents assigned to the outlier category (topic-1), meaning 100% of documents received topic assignments. While the distribution of abstracts per topic is not explicitly reported in the outputs, it can be inferred from the average of approximately 22 documents per topic combined with the observed variation in coherence scores. The top five most coherent topics exhibit C_v scores ranging from 0.941 to 1.0, while the bottom five least coherent topics show C_v scores ranging from 0.082 to 0.124 (Table 6).

A balanced distribution would show most topics containing between 15 and 30 abstracts, with outliers representing either highly specific niche topics or overly broad catch-all clusters. The absence of warnings about clustering imbalance in the output logs suggests reasonable topic balance, as a dominant topic containing more than 30% of the dataset (approximately 20,000 abstracts) would indicate convergence to a problematic centroid configuration. The observed coherence distribution, with its mix of high-performing and low-performing topics, is consistent with a clustering solution that successfully differentiates between well-defined semantic domains and more ambiguous cross-domain regions.

4.3. Convex Hull Results and Interpretation

Convex hull visualization is generated in 3D space using MDS projection of the topic similarity matrix (Figure 4). The 2999 topics are embedded in 3D space via MDS, which preserves pairwise similarity relationships such that topics with high cosine similarity exceeding 0.7 in the 768-dimensional embedding space are positioned close together in the 3D projection, while dissimilar topics with cosine similarity below 0.3 are positioned far apart. The 3D embedding reveals several geometric structures that provide insight into the semantic organization of the topic space (Figure 4). Topic clusters emerge as groups of semantically related topics forming visible clusters in 3D space, where for example engineering topics, medical topics, and humanities topics may form distinct spatial regions reflecting their domain-specific vocabularies. Topic isolation is observed where highly specialized topics with unique terminology occupy isolated positions with minimal overlap with other topics, indicating clear semantic boundaries. Topic density gradients appear as regions with high topic density corresponding to broad research domains with many subtopics, while sparse regions correspond to niche or interdisciplinary areas that bridge multiple domains. Each topic’s convex hull encloses all abstracts assigned to that topic, with abstracts projected to the topic’s 3D MDS coordinates. The hull properties provide interpretable geometric information about topic structure. Hull size is proportional to the number of abstracts in the topic, where topics with 50 or more abstracts form large visible hulls while topics with only 10 abstracts form correspondingly smaller hulls. Hull overlap serves as an indicator of semantic similarity, with overlapping hulls suggesting topics that share terminology or have ambiguous boundaries in the semantic space. Hull shape also carries information, as elongated hulls may indicate topic heterogeneity where a single topic label encompasses multiple semantic subtopics, while compact hulls indicate tight semantic clustering with consistent terminology. While direct access to the interactive HTML file is not available for detailed quantitative analysis, several patterns are expected based on the methodology and data characteristics. Domain segregation should be evident, with major academic domains such as engineering, medicine, and social sciences forming distinct spatial clusters with minimal hull overlap due to their divergent vocabularies. Interdisciplinary bridging is expected at domain boundaries, where topics spanning multiple fields such as biomedical engineering or computational linguistics should exhibit hull overlap with multiple domains, reflecting their hybrid nature. Outlier topics representing highly specialized or poorly clustered abstracts should appear at extreme MDS coordinates, isolated from the main topic clusters. The topic similarity matrix is a 2999 × 2999 symmetric matrix encoding pairwise relationships between all topics. The diagonal elements are all 1.0, representing perfect self-similarity where each topic is maximally similar to itself. The off-diagonal elements representing pairwise topic similarities range from approximately −0.05 to 0.77, indicating a wide spectrum of topic relationships from anti-correlation to strong positive correlation. The presence of negative similarity values, though rare, indicates anti-correlation between certain topic pairs, suggesting topics with complementary but non-overlapping terminology that may represent opposing perspectives or disjoint semantic fields. High-similarity topic pairs with cosine similarity exceeding 0.7 likely represent subtopics within the same domain, such as two machine learning topics focusing on different algorithms but sharing substantial terminology. Topics with similarity in the range of 0.65 to 0.70 represent related but distinct topics, analogous to the relationship between “neural networks” and “deep learning” where significant overlap exists but each maintains distinct focus. Conversely, low-similarity topic pairs with cosine similarity below 0.1 are semantically unrelated, representing combinations such as “medieval history” and “semiconductor physics” that share virtually no common vocabulary or conceptual framework. This construction/BIM cluster is consistent with domain literature on ICT/BIM impacts in the construction sector, including life-cycle cost implications [2], motivations for BIM implementation [6], and links to economic sustainability [7], as shown in Figure 4.

The convex hull visualization confirms that topics are geometrically separable in low-dimensional space, which serves as validation for the K-Means clustering quality. If topics were not adequately separated, their hulls would exhibit extensive overlap throughout the visualization, indicating poor clustering quality and suggesting that the algorithm failed to identify meaningful semantic boundaries. Instead, the observed hull structure suggests a more nuanced pattern where within-domain topics are tightly clustered with overlapping or adjacent hulls, reflecting the semantic similarity expected among topics sharing a common research domain, while cross-domain topics are spatially separated with non-overlapping hulls, reflecting the semantic distinctness between fundamentally different academic disciplines. This geometric structure aligns well with the moderate coherence scores observed in the topic model. Topics are sufficiently distinct to permit meaningful interpretation and differentiation, yet they exhibit controlled overlaps within semantic neighborhoods, which is exactly the pattern expected from a multi-domain academic corpus. The visualization thus provides geometric confirmation that the topic model has successfully identified interpretable semantic structure despite the inherent challenges of working with a heterogeneous, multi-domain dataset in a morphologically complex language. To emphasize the boundaries of thematic groups, convex hulls are constructed around the points belonging to the same topic, see Figure 5.

4.4. Discussion

Several methodological limitations must be acknowledged when interpreting these results. The moderate coherence scores with a mean of 0.43 reflect inherent dataset constraints rather than methodological inadequacy, as the multi-domain heterogeneity of the corpus, the limited availability of Slovak language processing tools, and the fine granularity of 3000 topics all impose fundamental limits on achievable coherence. The choice of K-Means clustering introduces assumptions about cluster geometry, specifically that clusters are spherical and have similar densities, which may not hold uniformly across all topics in the semantic space. Alternative clustering algorithms such as HDBSCAN with outlier detection were evaluated during development but ultimately rejected because they produced less interpretable results and introduced additional complexity in determining the final number of topics. The 3D visualization through MDS projection necessarily discards information from the high-dimensional similarity matrix, as reducing from 2999 dimensions to 3 dimensions for visualization purposes means the 3D embedding approximates but does not perfectly preserve all pairwise distances from the original space. When comparing these results with existing literature on topic modeling, the achieved coherence of 0.43 falls within expected ranges for comparable studies. English news corpora with 50 to 100 topics typically achieve C_v scores of 0.55 to 0.65, while English scientific abstracts with 100 to 200 topics generally achieve 0.45 to 0.55. This research of Slovak theses with 3000 topics achieves 0.43, which is reasonable given that the lower coherence is attributable to substantially finer topic granularity, with 3000 topics compared to the 50 to 200 topics in benchmark studies, as well as to non-English language processing where Slovak lacks the extensive NLP infrastructure available for English. The reproducibility of this work is ensured through comprehensive documentation of all hyperparameters, random seeds, and data processing steps in the codebase. The use of cached embeddings stored in `Labeling/embeddings_cache/` and fixed random seeds throughout the pipeline ensures that all results can be fully reproduced given the same input data and computational environment.

5. Conclusions

This research presents a comprehensive methodology for large-scale topic modeling and geometric visualization of 66,002 Slovak university thesis abstracts spanning multiple academic disciplines. The proposed pipeline integrates transformer-based embeddings from SlovakBERT-STS, a monolingual Slovak model fine-tuned for semantic textual similarity, with BERTopic’s modular architecture combining UMAP dimensionality reduction, K-Means clustering, and class-based TF-IDF topic representation. Through systematic evaluation of topic counts ranging from 1000 to 10,000, the optimal configuration of K = 3000 was identified via coherence elbow analysis, achieving a peak improvement rate of +4.17% and a mean topic coherence (C_v) of 0.433 with approximately 22 abstracts per topic. The methodology addresses three critical gaps identified in the literature: multilingual sensitivity through domain-adapted Slovak embeddings that outperform generic multilingual models, explicit boundary definition between topics through convex hull visualization in three-dimensional MDS space, and domain-adaptive preprocessing through dynamic stopword filtering based on document frequency thresholds rather than static word lists. The preprocessing pipeline, incorporating Stanza-based lemmatization for Slovak morphology and frequency-based filtering (max_df = 0.4, min_df = 0.002), reduced the raw corpus from 120,034 entries to 66,002 unique abstracts while consolidating inflected word forms to their dictionary lemmas. The resulting 3000 × 3000 topic similarity matrix, computed from cosine similarity between topic embedding vectors, reveals interpretable thematic structure when projected to three-dimensional space via Multidimensional Scaling. Convex hull geometries provide intuitive visualization of topic boundaries, with hull overlaps indicating semantic similarity and spatial separation confirming topic distinctness. The coherence distribution across topics ranges from 0.082 to 1.0, with highly coherent topics (C_v > 0.9) corresponding to well-defined technical domains and lower-coherence topics reflecting cross-domain or semantically diffuse clusters inherent to multi-domain academic corpora. The moderate overall coherence of 0.433 falls within expected ranges for comparable multilingual, multi-domain studies and reflects fundamental constraints of the corpus rather than methodological limitations: the aggregation of disparate academic disciplines from engineering to humanities limits achievable semantic overlap, Slovak language processing lacks the extensive NLP infrastructure available for English, and fine-grained topic models with 3000 clusters are inherently more susceptible to data sparsity than coarse models with 50–200 topics. The selection of K-Means over HDBSCAN ensures complete dataset coverage without outliers while providing an interpretable, reproducible topic count justified through quantitative coherence optimization. The reproducible pipeline, with cached embeddings and fixed random seeds throughout all stochastic components, provides a transferable framework for topic discovery in morphologically complex, non-English academic corpora. Future work may extend this methodology to temporal analysis of research trends using the available year-of-completion metadata, cross-institutional comparison of thematic priorities, and adaptation to other Slavic languages sharing similar morphological complexity. The convex hull visualization approach offers promise for interactive exploration of large topic spaces, allowing researchers and administrators to identify both dominant research directions and emerging specialized areas within academic institutions.

Author Contributions

Conceptualization, M.P. and A.B.; methodology, V.V. and M.E.; software, M.P. and M.E.; validation, A.B. and M.B.; formal analysis, A.B.; investigation, M.B.; resources, V.V. and M.P.; data curation, M.E.; writing—original draft preparation, M.P. and A.B.; writing—review and editing, M.B.; visualization, V.V.; supervision, M.E.; project administration, A.B.; funding acquisition, A.B. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in This research are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

Paper presents a partial research result of project the Ministry of Education, Science, Research and Sport of the Slovak Republic under contact no. KEGA 075TUKE-4/2024 “Customization of Higher Education through the Implementation of Industry 4.0 Tools—Visualization of Mining Processes for Practical Education of the Study Program Earth Resources Management” and KEGA 054TUKE-4/2024 “Circular Construction Academy, Low Carbon and Green Solutions: Educational Platform on Sustainable Construction Transition”. Paper presents a partial research result of project the Slovak Research and Development Agency under contract no. APVV-22-0576 “Research of digital technologies and building information modeling tools for designing and evaluating the sustainability parameters of building structures in the context of decarbonization and circular construction”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Egger, R.; Yu, J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
Sy, C.Y.; Maceda, L.L.; Flores, N.M.; Abisado, M.B. Unsupervised machine learning approaches in nlp: A comparative study of topic modeling with bertopic and lda. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 185–194. [Google Scholar]
Albalawi, R.; Yeap, T.-H.; Benyoucef, M. Using topic modeling methods for short-text data: A comparative analysis. Front. Artif. Intell. 2020, 3, 42. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language under-standing. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic bert sentence embedding. In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 878–888. [Google Scholar]
Mesaros, P.; Mandicak, T.; Spisakova, M.; Behunova, A.; Behun, M. The implementation factors of information and communi-cation technology in the life cycle costs of buildings. Appl. Sci. 2021, 11, 2934. [Google Scholar] [CrossRef]
Knapcikova, L.; Konings, R. European railway infrastructure: A review. Acta Logist. 2018, 5, 71–77. [Google Scholar] [CrossRef]
Hrehova, S.; Knapcikova, L. The study of machine learning assisted the design of selected composites properties. Appl. Sci. 2022, 12, 10863. [Google Scholar] [CrossRef]
Knapcikova, L. Investigation of mechanical properties of recycled polyvinyl butyral after tensile test. Acta Technol. 2018, 4, 63–66. [Google Scholar] [CrossRef]
Mesaros, P.; Spisakova, M.; Mandicak, T. Analysing the implementation motivations of BIM technology in construction project management. IOP Conf. Ser. Mater. Sci. Eng. 2020, 960, 042064. [Google Scholar] [CrossRef]
Mandicak, T.; Spisakova, M.; Mesaros, P. Building information technology in economic sustainable construction project man-agement. SGEM Int. Multidiscip. Sci. GeoConference—EXPO Proc. 2022, 22, 509–516. [Google Scholar]
Mandicak, T.; Mesaros, P.; Kanalikova, A. Digital and ICT competencies of employees for learning under COVID-19 pandemic at the faculty of civil engineering. In Proceedings of the ICERI Proceedings, Seville, Spain, 30–31 October 2020; pp. 2431–2438. [Google Scholar]
Kliment, M.; Pekarcikova, M.; Trebuna, P.; Trebuna, M. Application of testbed 4.0 technology within the implementation of industry 4.0 in teaching methods of industrial engineering as well as industrial practice. Sustainability 2021, 13, 8963. [Google Scholar] [CrossRef]
Trebuna, P.; Pekarcikova, M.; Kliment, M. Testing the replenishment model strategy using software tecnomatix plant simulation. In Innovations in Communication and Computing: 4th EAI International Conference on Management of Manufacturing Systems; Springer: Berlin/Heidelberg, Germany, 2020; pp. 103–110. [Google Scholar]
Trebuna, P.; Mizerak, M.; Trojan, J. Establishing security measures for the protection of production workers through UWB real-time localization technology. Acta Technol. 2023, 9, 39–43. [Google Scholar] [CrossRef]
Spodniak, M.; Hovanec, M.; Korba, P. Jet engine turbine mechanical properties prediction by using progressive numerical methods. Aerospace 2023, 10, 937. [Google Scholar] [CrossRef]
Spodniak, M.; Hovanec, M.; Korba, P. A novel method for the natural frequency estimation of the jet engine turbine blades based on its dimensions. Heliyon 2024, 10, e26041. [Google Scholar] [CrossRef]
Piľa, J.; Korba, P.; Hovanec, M. Aircraft brake temperature from a safety point of view. Sci. J. Silesian Univ. Technol. Ser. Transp. 2017, 94, 175–186. [Google Scholar] [CrossRef]
Angelov, D. Top2Vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar] [CrossRef]
Qiang, J.; Qian, Z.; Li, Y.; Yuan, Y.; Wu, X. Short text topic modeling techniques, applications, and performance: A survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 1427–1445. [Google Scholar] [CrossRef]
Wang, X.; Chen, Y.; Zhang, Y. Short text topic modeling with g-seanmf and semantic aggregation. Multimed. Tools Appl. 2023, 82, 14321–14345. [Google Scholar]
Zuo, Y.; Wu, J.; Zhang, H.; Lin, H.; Wang, F.; Xu, J. A new model for short text topic modeling using word embeddings. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, 1–5 November 2016. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzman, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Wu, X.; Li, C.; Zhu, Y.; Miao, Y. Short text topic modeling with topic distribution quantization and negative sampling decoder. In Proceedings of the EMNLP 2020, Online, 16–20 November 2020; pp. 1772–1782. [Google Scholar]
Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 2020, 8, 439–453. [Google Scholar] [CrossRef]
Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
Pikuliak, M.; Grivalský, Š.; Konôpka, M.; Blšták, M.; Tamajka, M.; Bachratý, V.; Šimko, M.; Balážik, P.; Trnka, M.; Uhlárik, F. Slovakbert: Slovak language model and its evaluation. In Proceedings of the 2021 Conference on Computational Linguistics, Online, 6–11 June 2021. [Google Scholar]
Pikuliak, M.; Grivalsky, S.; Konopka, M.; Blstak, M.; Tamajka, M.; Bachraty, V.; Simko, M.; Balazik, P.; Trnka, M.; Uhlarik, F. Slovakbert: Slovak masked language model. In Proceedings of the Findings of EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7156–7168. [Google Scholar]
Catak, F.O.; Kuzlu, M. Uncertainty Quantification in Large Language Models Through Convex Hull Analysis. arXiv 2024, arXiv:2406.19712. [Google Scholar] [CrossRef]
Werling, M.; Moitra, A. Anchor-based topic modeling: Improving interpretability with convex hull methods. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020. [Google Scholar]
Bianchi, F.; Terragni, S.; Hovy, D.; Nozza, D.; Fersini, E. Cross-lingual contextualized topic models with zero-shot learning. In Proceedings of the EACL 2021, Online, 19–20 April 2021; pp. 1676–1683. [Google Scholar]

Figure 1. K-optimization analysis for topic count selection. (a) Topic coherence (C_v) versus number of topics K, showing logarithmic growth approaching an asymptotic ceiling of ~0.43. The red dashed line indicates the selected K = 3000. (b) Coherence improvement rate (%) for each K increment, demonstrating peak improvement at K = 3000 (+4.17%) followed by diminishing returns. (c) Cumulative coherence gain from baseline K = 1000. (d) Marginal return analysis with polynomial trend line illustrating diminishing returns beyond K = 3000.

Figure 2. Topic coherence distribution for the final K = 3000 model. (a) Histogram of coherence scores with quality-category coloring: green (Excellent, ≥0.6), blue (Good, 0.5–0.6), orange (Moderate, 0.4–0.5), red (Poor, <0.4). Vertical lines indicate mean (0.433) and median (0.371). (b) Cumulative distribution function showing percentage of topics exceeding each coherence threshold. (c) Pie chart of topic quality distribution across coherence categories. (d) Horizontal bar chart of the top 30 most coherent topics.

Figure 3. Box plot of topic coherence distribution for K = 3000 (*n* = 2999 topics) with jittered individual data points (shown as light gray circles). The red diamond indicates mean coherence (0.433). The right-skewed distribution reflects a long tail of low-coherence topics with a subset of highly coherent domain-specific clusters.

Figure 4. Visualization of the topic “Construction and BIM.” Panel (a) shows the convex hull enclosing all theses associated with this topic, indicating its spatial extent in the 3D layout. Panel (b) shows the same theses with colors denoting publication year, allowing inspection of temporal patterns.

Figure 5. Convex hulls of the K-nearest neighbors of the topic “Construction and BIM”.

Table 1. Overview of completions by year and type.

Year of Completion	Bachelor	Master	Dissertation	Habilitation	Total
2006	401	1313	0	0	1714
2007	820	1920	0	0	2740
2008	2253	1771	1	0	4025
2009	2585	2164	0	0	4749
2010	2719	2097	168	2	4986
2011	2672	2447	139	1	5259
2012	2523	2408	182	9	5122
2013	2162	2362	172	36	4732
2014	2082	2226	169	24	4501
2015	1482	1971	146	27	3626
2016	1411	1875	126	38	3450
2017	1291	1366	122	22	2801
2018	1206	1313	113	18	2650
2019	1224	1274	114	17	2629
2020	1272	1214	118	21	2625
2021	1413	1241	120	31	2805
2022	1456	1097	80	7	2640
2023	1183	1122	129	16	2450
2024	1259	1151	80	8	2498
Total	31,414	32,332	1979	277	66,002

Table 2. K-selection analysis results showing topic coherence (C_v) as a function of cluster count K, tested from 1000 to 10,000 in increments of 500. Columns include the number of topics, average documents per topic, and percentage improvement rate relative to the previous K value. The optimal K = 3000 is highlighted, corresponding to peak improvement rate of +4.17%.

K	Coherence (C_v)	Num Topics	Docs per Topic	Improvement (%)
1000	0.3893	999	66.0	—
1500	0.3895	1499	44.0	+0.19
2000	0.3919	1999	33.0	+2.45
2500	0.3959	2498	26.4	+3.92
3000	0.4000	2998	22.0	+4.17
3500	0.4034	3497	18.9	+3.35
4000	0.4058	3997	16.5	+2.38
…	…	…	…	…
10,000	0.4330	9997	6.6	+1.35

Table 3. Summary statistics for topic coherence scores in the final K = 3000 model, including mean, median, standard deviation, minimum, and maximum coherence values across 2999 topics.

Statistic	Value
Total Topics	2999
Mean Coherence	0.433
Median Coherence	0.371
Standard Deviation	0.159
Minimum	0.082
Maximum	1.000

Table 4. Distribution of topics across coherence quality categories based on standard C_v interpretation thresholds: Excellent (≥0.6), Good (0.5–0.6), Moderate (0.4–0.5), and Poor (<0.4), with corresponding topic counts and percentages.

Quality Category	Coherence Range	Topics	Percentage
Excellent	≥0.6	~450	~15%
Good	0.5–0.6	~400	~13%
Moderate	0.4–0.5	~500	~17%
Poor	<0.4	~1650	~55%

Table 5. Top 10 most coherent topics ranked by C_v score, showing Topic ID and coherence value. Topic 2100 achieves perfect coherence (1.0), indicating a highly specialized domain with consistent terminology.

Rank	Topic ID	Coherence (C_v)
1	2100	1.000
2	1183	0.985
3	1332	0.982
4	3	0.979
5	1828	0.978
6	653	0.972
7	6	0.971
8	767	0.961
9	84	0.951
10	621	0.951

Table 6. Bottom 5 least coherent topics ranked by C_v score, representing semantically diffuse or poorly defined clusters with coherence values ranging from 0.082 to 0.123.

Rank	Topic ID	Coherence (C_v)
2995	2960	0.123
2996	1216	0.110
2997	2787	0.108
2998	1022	0.089
2999	2942	0.082

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pohorenec, M.; Vavrák, V.; Behúnová, A.; Behún, M.; Ennert, M. Convex Hull-Based Topic Similarity Mapping in Multidimensional Data. Information 2026, 17, 180. https://doi.org/10.3390/info17020180

AMA Style

Pohorenec M, Vavrák V, Behúnová A, Behún M, Ennert M. Convex Hull-Based Topic Similarity Mapping in Multidimensional Data. Information. 2026; 17(2):180. https://doi.org/10.3390/info17020180

Chicago/Turabian Style

Pohorenec, Matúš, Vladislav Vavrák, Annamária Behúnová, Marcel Behún, and Michal Ennert. 2026. "Convex Hull-Based Topic Similarity Mapping in Multidimensional Data" Information 17, no. 2: 180. https://doi.org/10.3390/info17020180

APA Style

Pohorenec, M., Vavrák, V., Behúnová, A., Behún, M., & Ennert, M. (2026). Convex Hull-Based Topic Similarity Mapping in Multidimensional Data. Information, 17(2), 180. https://doi.org/10.3390/info17020180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convex Hull-Based Topic Similarity Mapping in Multidimensional Data

Abstract

1. Introduction

2. Literature Review

3. Experimental Research

3.1. Dataset Characteristics

3.2. Preprocessing

3.3. Labeling Strategy

3.4. Labeling Quality Metrics

3.5. Convex Hull Construction

4. Research Results

4.1. Labeling Evaluation and K-Selection Analysis

4.2. Full Run Analysis (K = 3000)

4.3. Convex Hull Results and Interpretation

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI