Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data

Ruvunangiza, Jeremie Biringanine; Sakuyama, Carlos Alberto Valderrama

doi:10.3390/aimed1020010

Open AccessArticle

Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data

by

Jeremie Biringanine Ruvunangiza

^*

and

Carlos Alberto Valderrama Sakuyama

^*

SEMi Department of Electronics & Microelectronics, Polytechnique Faculty, UMons University of Mons, 7000 Mons, Belgium

^*

Authors to whom correspondence should be addressed.

AI Med. 2026, 1(2), 10; https://doi.org/10.3390/aimed1020010

Submission received: 13 December 2025 / Revised: 29 January 2026 / Accepted: 26 February 2026 / Published: 3 April 2026

Download

Browse Figures

Versions Notes

Abstract

The exponential growth of unstructured medical data, particularly clinical notes and diagnostic reports, presents mounting challenges for healthcare knowledge extraction and utilization. This study introduces the Digital Medical Catalog (DMC), a framework that automates the conversion of clinical narratives into an auditable, semantically structured knowledge base. The framework combines BioClinicalBERT embeddings, c-TF-IDF statistical grounding, and semantic clustering, enabling high-fidelity classification (Macro F1 = 0.877 ± 0.012), traceable topic labeling, and temporal trend analysis. By demonstrating that semantic representation methods, reinforced with statistical grounding, are essential for large-scale medical text processing, this work establishes a foundation for privacy-preserving data governance and real-time intelligence within modern healthcare infrastructures.

Keywords:

semantic NLP; Bio-ClinicalBERT; TF-IDF; data auditing; F1-score; topic modeling; data catalog

1. Introduction

Healthcare systems generate a large amount of unstructured data daily, including clinical notes, diagnostic reports, and imaging descriptions. Although these narratives offer rich clinical insight, their unstructured nature poses significant challenges for data management, interoperability, and large-scale analysis [1]. The absence of standardized formats and terminologies limits the extraction of actionable knowledge and complicates data integration across institutions. Traditional manual curation remains labor intensive, error prone, and unscalable, creating persistent bottlenecks in clinical workflows. As healthcare organizations advance their digital transformation, the need for scalable and automated approaches to process unstructured medical data has become increasingly critical. These approaches are now essential to effectively handle the growing volume and complexity of medical information. This paper addresses this challenge by proposing an AI-powered Digital Medical Catalog, or DMC, designed to automatically classify, analyze, and organize diverse medical data types, thus transforming raw information into structured, actionable insights. The novelty of this work does not lie in individual NLP components such as contextual embeddings, dimensionality reduction, clustering, or keyword extraction, which are well established in existing topic-modeling frameworks including BioClinicalBERT. Rather, the contribution is a governance-oriented system design that integrates semantic representation with compliance-grade statistical grounding, an explicit Explainability Confidence Index (ECI), and a catalog abstraction that clearly separates modeling from auditability. By emphasizing traceability, confidence signaling, and structural transparency, the proposed Digital Medical Catalog is designed to support governance-ready analytics rather than to optimize task-specific topic discovery alone. While the framework is designed to enable downstream applications such as privacy-preserving analytics and federated learning, these aspects are presented as future integrations and are not evaluated in the scope of this study.

The integration of artificial intelligence (AI) and natural language processing (NLP) into clinical settings holds considerable promise, with domain-specific large language models (LLM) such as BioClinicalBERT demonstrating robust performance in the analysis of complex medical semantics and contextual relationships within medical text [1,2]. Nevertheless, translating these technical advances into routine practice remains hampered by persistent operational challenges. Seamlessly embedding these NLP capabilities in established clinical workflows and managing the profound heterogeneity of healthcare data sources continue to be significant barriers to widespread implementation [3].

To address these limitations, we propose the Digital Medical Catalog (DMC), a framework that automates the classification, organization, and semantic analysis of unstructured clinical narratives [4]. The DMC applies advanced NLP methods to group clinical texts into semantically related categories, extract and classify key medical entities (e.g., symptoms, diagnoses, treatments), and perform longitudinal trend analyzes to uncover temporal patterns in medical documentation. The resulting output is a structured and hierarchical catalog enriched with metadata such as timestamps, frequency metrics, and topic classifications [5]. This transformation enables the rapid retrieval of clinically relevant information and supports the automated detection of emerging trends that can be overlooked in manual review. Furthermore, DMC integrates with existing health information systems to facilitate interoperability and offers a dynamic searchable interface for clinicians and researchers to explore and query medical data with unprecedented efficiency [6]. As shown in Figure 1, the DMC framework integrates local catalog creation, federated model training, and trend analysis into a unified pipeline, ensuring data sovereignty and auditability across institutions.

This work examines whether free-text clinical narratives can automatically be converted into a structured, auditable knowledge base that maintains semantic accuracy while supporting governance-ready analytics and temporal monitoring [7]. To ensure the reliability of this automated processing, we introduce a validation mechanism that hybridizes semantic and statistical methods, using TF-IDF grounding to constrain the model output and mitigate factual errors [8]. Finally, we demonstrate that this cataloged and validated data foundation enables dynamic risk monitoring, where temporal analytics uncover critical trends for decision support and public health surveillance [9]. Together, these elements deliver a system capable of high-fidelity classification (Macro F1 = 0.877), auditable AI outputs, and proactive clinical insight generation.

The remainder of this paper is organized as follows. Section 2 delivers an extensive examination of existing efforts in the areas of automated data cataloging, data stewardship, and the integration of databases within the healthcare sector. Section 3 outlines our Digital Medical Catalog framework proposal, detailing the data preprocessing pipeline, the semantic feature engineering technique, and the hybrid labeling strategy employed. Section 4 showcases the experimental outcomes, including quantitative performance metrics and an analysis of temporal trends. Section 5 explores the implications of our research findings, the limitations of the current methodology, and potential avenues for future exploration. Lastly, Section 6 encapsulates the principal contributions and brings the paper to a close.

The organizational structure ensures a logical flow from problem contextualization through solution development to empirical validation and critical analysis, providing readers with a clear understanding of both the theoretical foundations and practical implementations of our Digital Medical Catalog framework.

Although the underlying pipeline shares structural similarities with existing topic-modeling approaches, the distinguishing contribution of this work lies in its system-level emphasis on auditability, governance, and traceable semantic organization rather than on topic discovery performance alone.

Unlike many recent clinical NLP systems that focus on predictive accuracy or summarization, the Digital Medical Catalog is intended as an infrastructure layer that supports review, oversight, and downstream analytics through explicit confidence estimation and grounding mechanisms.

Claims regarding deployment readiness should therefore be interpreted in terms of design alignment rather than operational validation, as real-world deployment would require institutional integration, clinical oversight, and compliance verification beyond the scope of this work.

2. Literature Review

The challenge of extracting structured knowledge from unstructured clinical text has been a persistent focus of healthcare informatics research. This section reviews existing approaches to automated data cataloging, data stewardship, and database integration for healthcare data, highlighting the limitations that motivate our integrated framework.

2.1. Automated Data Cataloging in Healthcare

Traditional data catalogs function as centralized inventories for data assets, significantly improving discoverability and governance of structured data [10]. However, their reliance on manual metadata entry makes them impractical for the volume and heterogeneity of unstructured clinical narratives [2,11]. This scalability gap has driven the development of automated solutions.

Recent advances in Natural Language Processing and Large Language Models offer transformative potential. Systems such as BioClinicalBERT [1] have demonstrated robust capabilities in generating semantically meaningful representations from clinical text, providing a foundation for automated metadata extraction. These techniques surpass manual methods in both scale and accuracy. Despite this technical promise, significant adoption barriers remain, particularly in terms of seamless integration into clinical workflows and handling the wide diversity of documentation styles found in real-world settings [12,13].

Our Digital Medical Catalog framework advances the state of the art by integrating automated classification and topic extraction into an end-to-end system. Addresses the critical challenges of scale and heterogeneity that have limited previous approaches, while maintaining interoperability with existing Electronic Health Record (EHR) infrastructures.

2.2. Data Stewardship and Governance

In healthcare, data stewardship encompasses the entire data lifecycle, requiring stringent quality control, privacy safeguards (e.g., HIPAA, GDPR), and context-aware access management. The sensitivity of patient data requires specialized protections, including end-to-end encryption, granular audit trails, and metadata frameworks that preserve both clinical meaning and regulatory compliance.

The foundational organizational frameworks for data stewardship have been established in prior work [10]. However, these models often depend on manual interventions, rendering them inadequate for the automated high-frequency processing demands of modern clinical data streams [14]. Maintaining a medical data catalog is particularly resource intensive due to the convergence of heterogeneous information systems, constantly evolving medical terminology, and strict compliance requirements, a challenge also noted in ML data management contexts [14].

Our framework addresses these limitations through an automated stewardship architecture that combines NLP-driven classification with continuous catalog synchronization. This approach enforces standardized quality protocols and immutable lineage tracking, creating governance-ready data assets that inherently satisfy healthcare’s requirements for auditability and semantic consistency.

2.3. Database Integration and Data Management

Effective use of processed clinical narratives requires storage architectures that balance high-speed data ingestion, low-latency retrieval, and sophisticated semantic query capabilities. Traditional systems often prioritized metadata management over content exploitation, but modern approaches increasingly adopt hybrid models that combine NoSQL scalability with relational query robustness.

Although previous research [10] has demonstrated the benefits of tightly coupling data catalogs with storage systems to improve schema awareness, these frameworks require extension to handle the unique semantics of clinical data. A key challenge is to preserve the nuanced context of the original clinical documentation during the transformation to a structured, computable format; a problem also encountered in other domains of categorical data analysis [15].

Our solution advances this paradigm by deterministically converting unstructured narratives into normalized relational records. This captures hierarchical clinical classifications (e.g., ICD-11 categories), automatically derived topics (with SNOMED CT mappings), precise temporal metadata, and complete provenance chains. This structured representation resolves the long-standing tension between narrative fidelity and computational utility, allowing efficient querying and retention of essential clinical context, as advocated in previous work on EHR data clustering for improved navigation [16].

3. Methodology

To develop and evaluate the Digital Medical Catalog, we employed a multistage methodology centered on processing real-world clinical narratives. This section details the data source, the preprocessing pipeline, and the unsupervised analytical framework used to transform raw text into a structured, searchable knowledge base. Figure 2 shows our comprehensive methodology.

3.1. Data Source and Collection

The development and evaluation of the Digital Medical Catalog framework was performed using the Microsoft Clinical Visit Note Summarization Corpus [17]. This publicly available data set provides a comprehensive collection of deidentified clinical visit notes that capture authentic clinical language patterns, ensuring that the catalog is built on and tested against data that accurately reflect the complexity of daily medical practice. The final corpus was augmented with synthesized temporal information to allow longitudinal trend analysis over a three-year period. This was achieved by programmatically assigning a random date to each document, distributing them uniformly across a simulated three-year timeline to enable robust temporal analysis.

The clinical corpus utilized in this study lacks dependable longitudinal timestamps that would support realistic temporal analysis. To assess the performance of the proposed temporal analytics module, we therefore generate and assign synthetic timestamps distributed uniformly over a three-year period. This approach makes it possible to verify the trend detection mechanism in a controlled setting. Crucially, the derived temporal patterns are intended to showcase methodological capabilities rather than to represent actual clinical or epidemiological dynamics.

3.2. Preprocessing Pipeline

The raw clinical text requires significant normalization before analysis. Our preprocessing pipeline consists of four standardized sequential phases designed to clean and structure the data while preserving critical medical meaning:

Medical tokenization—We apply a tokenizer with medical-specific rules to handle clinical abbreviations and shorthand effectively (e.g., variations such as “ECG” and “EKG” are unified for consistent analysis) [18].
Lemmatization with Clinical Lexicons—Tokens are lemmatized (e.g., “swollen” ‘swell’) using processes guided by clinical lexicons and SNOMED CT mappings. This step standardizes terminology variants and connects surface forms to standardized medical concepts [19]. This normalization process is crucial for effective downstream analysis, mitigating the challenges posed by uncertainty of natural language, lexical variations, and the frequent appearance of abbreviations and acronyms in clinical notes [20].
Quality Assurance (QA)—A final QA phase verifies data integrity, flagging incomplete notes, and validating that critical medical terms survive the preprocessing stages intact.

3.3. Integrated Semantic Analytical Framework

The core of our methodology involves a novel suite of unsupervised learning techniques designed to leverage semantic certainty over statistical association, as confirmed by our comparative analysis. All components were initialized with a fixed random state (42) to ensure reproducibility. This framework iteratively refines topic identification and concept extraction to provide robust insights into underlying clinical trends [21].

This integrated methodological approach transforms unstructured clinical notes into a dynamic, hierarchically organized knowledge base. The final structured outputs, including the assigned cluster labels, metadata, and trend tags, are organized into the Digital Medical Catalog (Figure 2). This modular architecture ensures that all analytical components can be independently upgraded (for example, replacing the embedding model) without disrupting the ingestion or the final output structure.

3.3.1. Semantic Feature Engineering

To distill clinically meaningful features from the text corpus, we employed a two-stage representation learning pipeline.

First during Semantic Vectorization, we encoded each document into a dense semantic vector using the BioClinicalBERT model (emilyalsentzer/Bio_ClinicalBERT) [22,23], generating 768-dimensional embeddings via mean pooling of the final hidden layer to capture rich contextual meaning. This effectively contextualizes clinical jargon and is critical for enabling a subsequent accurate categorization. Subsequently, these high-dimensional representations were refined during Dimensionality Reduction, using UMAP for dimensionality reduction, a technique proven to enhance the discernibility of underlying cluster structures [24,25]. This process yielded an optimized feature space essential for robust clustering analysis.

3.3.2. Semantic Clustering and Hyperparameter Validation

Our semantic clustering methodology employed three key components to ensure robust topic discovery.

For the Cluster Algorithm, we implemented K-Means within the BERTopic framework to guaranty deterministic assignment from document to theme [26].

Through systematic validation of

N_{T O P I C S}

, we established

N = 8

as the optimal topic count by searching the grid to maximize the intrinsic coherence of the topic (

C_v

) while maintaining the integrity of the Silhouette score [27]. Finally, our Extrinsic Validation demonstrated high-fidelity classification with an F1-score of 0.877 against a synthetic gold standard using Hungarian algorithm mapping, confirming the clinical relevance of the derived topics [28].

3.3.3. Hybrid Labeling and Auditable Grounding

We developed a Compliance-Grade Auditable Grounding, a three-stage methodology for compliance-grade topic labeling that ensures auditability through statistical grounding. This hybrid approach constrains LLM interpretation with verifiable statistical data.

First, our Auditable Grounding process (Pattern Extraction) employs context-based TF-IDF (c-TF-IDF) to generate verifiable keyword patterns from clustered documents, establishing an empirical foundation for label generation. Traditional TF-IDF statistical analysis is not used for clustering, but for generating a verifiable keyword pattern set (c-TF-IDF). Second, Hybrid Labeling constrains LLM interpretation to these verified keyword sets, forcing semantically coherent topic titles to emerge directly from statistical evidence while mitigating hallucination risks [29]. Finally, Trend Analysis tracks emerging clinical patterns by calculating the relative frequency changes of validated semantic topics at quarterly intervals, enabling temporal monitoring of the evolution of healthcare documentation [30].

4. Results

Following the implementation of the Digital Medical Catalog framework and comparative processing of the clinical data set, we evaluated its performance on three dimensions: structural core validation (grouping metrics), extrinsic precision (classification metrics), and practical utility (temporal trend analysis).

4.1. Confidence Metric

We justify the choice of eight groups and present the confidence scores derived from Explainable AI (XAI) that quantify the reliability and separability of each semantic topic generated by the Digital Medical Catalog [31].

4.2. Justification for Eight Clusters

To identify the optimal number of semantic topics, we performed a grid search across

N \in {4, 6, 8, 10, 12}

and evaluated each configuration using Topic Coherence (

C_{v}

), Silhouette Score, Davies–Bouldin Index, and the Explainability Confidence Index (ECI), defined as

N^{*} = arg max_{N} [C_{v} (N) + Silhouette (N) - DBI (N) + ECI (N)] .

(1)

The configuration

N = 8

maximized the joint objective, resulting in the most stable, clinically coherent, and auditable topic structure. Smaller values merged distinct topics, while larger values produced fragmentation and semantic drift.

4.3. Explainable AI Confidence Analysis

Document-level confidence is computed as a weighted combination of semantic distance and statistical grounding.

Conf (d) = α (1 - \frac{∥ E_{d} - μ_{k} ∥}{{max}_{j} ∥ E_{d} - μ_{j} ∥}) + (1 - α) \cdot {cTFIDF}_{k} (d),

(2)

with

α = 0.65

, where

E_{d}

is the BioClinicalBERT embedding for document d,

μ_{k}

is the centroid of the assigned cluster and

{cTFIDF}_{k} (d)

quantifies the statistical keyword grounding.

4.4. Cluster Confidence Summary

These diagnostics confirm strong semantic separability, stable topic boundaries, and robust grounding in interpretable keyword patterns, validating the eight-cluster configuration for the Digital Medical Catalog.

Table 1 presents the quantitative confidence metrics for each cluster across three dimensions. Semantic Density measures embedding-space compactness (higher values indicate tighter clustering), Statistical Grounding reflects c-TF-IDF peak sharpness, and the Explainability Confidence Index (ECI) serves as an integrated metric combining both measures. The ECI values range from 0.80 to 0.90, with Clinical Data Auditing (0.90) and Prosthetic Complications (0.89) showing the strongest overall confidence, while Surgical Follow-Up (0.80) and Infectious Disease (0.81) exhibit relatively lower but still acceptable scores—likely due to greater inherent variability in these clinical domains.

4.5. Quantitative Performance Metrics

To ensure a fair comparison between statistical and semantic approaches, all preprocessing steps, clustering algorithms, and the number of clusters were held constant across baselines. The TF-IDF representation therefore differs only in the feature space used for document encoding, while all other experimental conditions remain identical.

To assess the relative effectiveness of statistical and semantic methods, we compared the Integrated Semantic Framework (Bio-ClinicalBERT) against a Statistical Baseline (TF-IDF). The metrics for structural quality (Silhouette, Davies–Bouldin) and extrinsic precision (F1-Score, Coherence) are presented in Table 2. The results demonstrate the inadequacy of traditional statistical methods for complex clinical narratives, where the baseline achieved near random performance (F1-Score ≈ 0.100). Conversely, the Semantic Framework achieved high-fidelity classification, confirming its suitability for production-level deployment.

4.6. Automated Classification and Topic Extraction

The application of semantic clustering successfully organized the medical documents into eight highly coherent clusters (

N = 8

), as confirmed by the high

C_{v}

score. These themes form the final, auditable categories of the Digital Medical Catalog. Cluster labels, generated using the hybrid labeling approach (constraining LLM output with statistical patterns), are detailed in Table 3. The table lists the eight catalog themes discovered by the hybrid framework, with document counts, three-year percentage changes, and key trends insights. The framework successfully isolated both the clinical disease themes and the crucial metadata governance themes required for distributed systems research, which are often conflated in traditional models.

4.7. Synthetic Gold Standard and Extrinsic Evaluation

Because the framework operates in an unsupervised setting, no clinically annotated ground-truth labels are available for supervised evaluation. To enable extrinsic assessment of structural consistency, we construct a synthetic gold standard that serves as a controlled reference rather than a clinical truth benchmark. Documents are assigned to a fixed number of reference classes that match the target clustering configuration (eight in this study). These reference labels are generated to form coherent thematic groupings based on predefined semantic criteria, allowing for the evaluation of whether the discovered clusters recover a comparable structural organization. To align the discovered clusters with the reference classes, we applied the Hungarian matching algorithm to compute an optimal one-to-one correspondence that maximizes agreement between cluster assignments and reference labels. This procedure avoids arbitrary label permutations and ensures that evaluation metrics reflect structural similarity rather than naming conventions.

Under this construction, the F1-score quantifies the degree to which the learned semantic organization aligns with the synthetic reference partition. It does not measure diagnostic accuracy or clinical precision, but rather assesses whether the framework reliably groups semantically related clinical narratives into stable and interpretable clusters under controlled conditions.

4.8. Temporal Trend Analysis Findings

The utility of the system as a dynamic evidence stream is confirmed by the temporal analysis in Table 3, which tracks the relative frequency trajectories of clinical topics over a synthetic three-year period. Examining the percentage change column reveals distinct patterns across domains that demonstrate the framework’s monitoring capabilities.

A critical risk signal emerges from Cluster 2 (Surgical Planning and Prosthetic Complications), which exhibits the steepest positive trajectory with a substantial

+ 25.5 %

increase in documentation frequency. This sharp upward trend, standing well above the average cluster change of

+ 9.5 %

, validates the framework’s capacity to automatically flag emerging clinical safety signals through longitudinal monitoring. Other notable trajectories include Cluster 1 (Oncology Markers) with

+ 18.1 %

growth, potentially reflecting increased cancer screening activities, and Cluster 6 (Advanced Cryptography) at

+ 15.0 %

, suggesting growing attention to quantum-secured patient data. Conversely, Cluster 3 (Cardiology) shows a slight decline (

- 3.0 %

), while Cluster 7 (Surgical Follow-up) remains essentially stable (

- 1.2 %

). The uniform normalized document counts (2500 per cluster) ensure these percentage changes reflect genuine shifts in thematic emphasis rather than volume differences. This visually validates the emergence of a high-priority clinical risk signal.

These findings collectively demonstrate that the Integrated Semantic Framework can automatically classify unstructured clinical notes into meaningful high-precision categories and generate dynamic longitudinal insights essential for resource management and auditability. Because timestamps are assigned synthetically, any temporal patterns described in this section should be viewed solely as illustrative examples. They serve only to demonstrate the framework’s monitoring functionality. These trajectories are not evidence of causal effects or actual clinical change; instead, they demonstrate the system’s capacity to detect and display shifts in thematic focus over time.

5. Discussion

The Digital Medical Catalog framework demonstrates the transformative potential of integrating natural language processing (NLP) and unsupervised learning to structure clinical narratives into a dynamic knowledge base. Our approach addresses a fundamental limitation in health informatics: the inability of traditional statistical methods to extract computable, semantically rich insights without sacrificing clinical fidelity. The temporal analyzes presented in this work should be understood as a validation of system capability rather than as observational evidence of clinical evolution. In real-world deployments, meaningful temporal interpretation would require institution-specific timestamps, clinical context, and governance oversight, which are beyond the scope of the present study. Moving from token co-occurrence to contextual embeddings, the system reveals meaningful patterns that are structurally validated and instantly actionable [32,33]. Specifically, the framework’s capacity to detect subtle semantic nuances in clinical text leads to a more precise and holistic understanding of healthcare data, which is essential for facilitating the structured analysis and evaluation of such data. Although the underlying pipeline shares structural similarities with existing topic-modeling approaches, the distinguishing contribution of this work lies in its system-level emphasis on auditability, governance, and traceable semantic organization rather than on topic discovery performance alone.

5.1. Interpretation of Principal Findings and Performance

The findings clearly demonstrate that it is essential to redirect analytical attention away from statistical NLP toward Semantic Feature Engineering [32].

5.1.1. Quantified Validation

Our Hybrid Semantic Framework (BioClinicalBERT→ K-Means) achieved a structural and thematic consistency (F1-Score of

0.877

), a dramatic improvement over the statistical baseline (

F 1 \approx 0.100

). The high Topic Coherence (

C_{v}

of

0.533

) confirms the semantic soundness of the eight discovered clusters, ensuring that the outputs are highly interpretable by domain experts. This strong quantitative performance validates the decision to use advanced semantic embeddings, which effectively resolve the ambiguity and jargon challenges inherent in clinical documentation.

5.1.2. Auditable Grounding

A key achievement is the implementation of the Auditable grounding approach. This method uses the stability of the TF-IDF keyword patterns to constrain the output of LLM-generated labels. As demonstrated in our testing (Scenario 4), this constraint prevents hallucination and ensures that every final topic title and the insight derived are traceable to verifiable evidence within the raw data for deploying the system in regulated environments. This mechanism provides confidence-aligned traceability that supports governance and audit requirements in regulated environments, subject to institutional validation and oversight.

5.1.3. Dynamic Insight Generation

The temporal analysis capability of the framework moves the system beyond static organization to active risk monitoring, as demonstrated in the temporal trend analysis (Section 4.8). By tracking percentage changes across the synthetic three-year span, the framework successfully identifies illustrative temporal shifts in clinical documentation patterns.

Because timestamps are synthetically assigned, these patterns demonstrate the functionality of the temporal analytics module rather than real-world clinical or institutional evolution. Indeed, as detailed in Table 3, notable trajectories include a substantial increase in Cluster 2 (Surgical Planning and Prosthetic Complications), signaling growing attention to surgical risks and prosthetic outcomes. Other clusters exhibit different patterns: Cluster 0 (Clinical Data Minimization and Access Auditing) shows stable but continued growth, while Cluster 3 (Cardiology) exhibits a slight decline.

5.2. Implications for Clinical Practice and Distributed Research

The practical implications of this framework directly support the transition toward decentralized, privacy-preserving healthcare.

For clinical practice, the catalog provides a mechanism for instant retrieval of similar clinical cases based on contextual meaning, rather than keyword matching, offering evidence-based decision support at the point of care. For researchers and administrators, the system functions as a dynamic surveillance tool. Automated detection of increasing complication rates, such as the increase observed in the prosthetic care group

\approx 25.5 %

can trigger immediate reviews of surgical protocols or device efficacy, allowing proactive healthcare management.

Crucially, the high-fidelity F1-Score and Auditable Grounding ensure that the system can operate as a reliable metadata generator for advanced research initiatives, fulfilling a major requirement for secure, multi-institutional research collaboration. By transforming unstructured notes into structured, tagged data, the framework provides structured and auditable metadata that may enable future integration with Federated Learning models models for multi-institutional research, which is not evaluated in the scope of this study.

5.3. Limitations

Although the Digital Medical Catalog performed well in the eight-cluster configuration, several limitations emerged that directly affect the reliability and generalizability of the findings.

5.3.1. Imbalanced Cluster Structure in the Eight-Cluster Model

The eight clusters derived from the dataset were highly unbalanced, with sizes ranging from as few as three notes to more than twenty. Clusters with very small membership were sensitive to noise and produced unstable topic boundaries, limiting the interpretability of rare clinical themes. These imbalances also magnify fluctuations in temporal trend analysis, making conclusions about smaller clusters less reliable.

5.3.2. Embedding Bias and Topic Drift Across the Eight Clusters

Although clinical sentence embeddings improved cohesion within the eight clusters, they occasionally overemphasized lexical overlap, causing certain clusters such as oncology related themes to absorb notes with only superficial biochemical similarity. This represents a concrete failure mode where embedding bias influences the eight clusters assignments, introducing topic drift and reducing interpretability in borderline cases.

Collectively, these limitations highlight the need for larger and more diverse datasets, multi-label or hierarchical extensions of the eight clusters model, and domain adaptive embedding strategies to improve robustness and real world applicability.

6. Conclusions

The Digital Medical Catalog framework presented in this work successfully addresses a critical and persistent challenge in modern healthcare: the efficient transformation of unstructured clinical narratives into a structured, computable knowledge base without sacrificing the essential semantic nuance required for high-quality patient care. By integrating Semantic Feature Engineering (Bio-ClinicalBERT) with auditable statistical methods, the system moves past the limitations of traditional statistical NLP, which proved inadequate for clinical complexity (F1-Score ≈ 0.100).

The core achievement of our framework is demonstrated by its high-fidelity classification capability, which achieves an empirically validated F1-Score of 0.877 and a strong Topic Coherence (

C_{v} = 0.533

) in eight clinically relevant topics. This robust performance enables systematic organization of complex documentation and facilitates the discovery of longitudinal insights, such as the clear surge in documentation related to Surgical Planning and Prosthetic Complications.

Crucially, the Hybrid Labeling and Auditable Grounding approach transforms the system into a compliance-grade solution. By constraining LLM-generated topic titles with verifiable TF-IDF patterns, the framework mitigates hallucination risk and provides the necessary traceability for audit logs in sensitive, distributed healthcare environments.

Our current findings show significant strength in processing discrete clinical concepts. However, limitations remain when handling the intricate, multi-faceted narratives typical of patients with comorbidities, where single-label clustering proves an oversimplification. These challenges illuminate a clear path for future work:

Privacy-Preserving Architecture—Development of a Federated Learning architecture for secure, multi-institutional model collaboration, strengthened by advanced cryptographic techniques to ensure HIPAA and GDPR compliance.
Multifaceted Classification—Implementation of multi-label and hierarchical clustering approaches to accurately capture the complexity of patient records and correlation between clinical themes.
Multimodal Evolution—Expanding the framework into a comprehensive clinical intelligence platform by incorporating other data modalities, such as medical imaging reports and structured lab data.

This work establishes that context-aware computational approaches to clinical text analysis need not force a binary choice between analytical rigor and clinical fidelity. With careful and context-aware design, our framework paves the way for AI systems that serve as genuine, auditable partners in the mission to improve healthcare delivery.

Author Contributions

Conceptualization, J.B.R. and C.A.V.S.; software, J.B.R.; validation, J.B.R. and C.A.V.S.; writing—original draft preparation, J.B.R. and C.A.V.S.; writing—review and editing, J.B.R. and C.A.V.S.; supervision, funding acquisition, C.A.V.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The synthetic timestamp data used for temporal analysis were generated specifically for this study to demonstrate framework functionality. The underlying clinical text data cannot be publicly shared due to privacy restrictions, but the clustering outputs and synthetic temporal trajectories are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.; Jin, D.; Naumann, T.; McDermott, M.B.A. Publicly Available Clinical BERT Embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar] [CrossRef]
Sheridan, H.; Dellureficio, A.J.; Ratajeski, M.A.; Mannheimer, S.; Wheeler, T.R. Data Curation through Catalogs: A Repository-Independent Model for Data Discovery. J. eSci. Librariansh. 2021, 10, 3. [Google Scholar] [CrossRef]
Yim, W.; Fu, Y.; Ben Abacha, A.; Snider, N.; Lin, T.; Yetisgen, M. Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Sci. Data 2023, 10, 586. [Google Scholar] [CrossRef]
Tavabi, N.; Singh, M.; Pruneski, J.; Kiapour, A.M. Systematic Evaluation of Common Natural Language Processing Techniques to Codify Clinical Notes. PLoS ONE 2024, 19, e0298892. [Google Scholar] [CrossRef]
Noroozizadeh, S.; Kumar, S.; Chen, G.H.; Weiss, J.C. PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus. arXiv 2025, arXiv:2505.20323. [Google Scholar] [CrossRef]
Riquelme, A.; Costa, P.; Martinez, C. Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case. arXiv 2025, arXiv:2507.03067. [Google Scholar] [CrossRef]
Deshpande, P.; Rasin, A.; Tchoua, R.; Furst, J.; Raicu, D.; Schinkel, M.; Trivedi, H.; Antani, S. Biomedical Heterogeneous Data Categorization and Schema Mapping Toward Data Integration. Front. Big Data 2023, 6, 1173038. [Google Scholar] [CrossRef]
Li, D.; Kadav, A.; Gao, A.; Li, R.; Bourgon, R. Automated Clinical Data Extraction with Knowledge Conditioned LLMs. arXiv 2024, arXiv:2406.18027. [Google Scholar] [CrossRef]
Zhang, Z.; Lin, X.; Wu, S. A Hybrid Algorithm for Clinical Decision Support in Precision Medicine Based on Machine Learning. BMC Bioinf. 2023, 24, 3. [Google Scholar] [CrossRef]
Ehrlinger, L.; Schrott, J.; Melichar, M.; Kirchmayr, N.; Wöß, W. Data Catalogs: A Systematic Literature Review and Guidelines to Implementation. In Database and Expert Systems Applications—DEXA 2021 Workshops; Kotsis, G., Tjoa, A.M., Khalil, I., Moser, B., Mashkoor, A., Sametinger, J., Fensel, A., Martinez-Gil, J., Fischer, L., Czech, G., et al., Eds.; Springer: Cham, Switzerland, 2021; pp. 148–158. [Google Scholar]
Adnan, K.; Akbar, R. An Analytical Study of Information Extraction from Unstructured and Multidimensional Big Data. J. Big Data 2019, 6, 91. [Google Scholar] [CrossRef]
Kumari, J.; Kumar, E.; Kumar, D. A Structured Analysis to Study the Role of Machine Learning and Deep Learning in the Healthcare Sector with Big Data Analytics. Arch. Comput. Methods Eng. 2023, 30, 3673–3701. [Google Scholar] [CrossRef]
Kocaman, V.; Talby, D. Accurate Clinical and Biomedical Named Entity Recognition at Scale. Softw. Impacts 2022, 13, 100373. [Google Scholar] [CrossRef]
Mao, Z.; Xu, Y.; Suarez, E. Dataset Management Platform for Machine Learning. Technical Disclosure Commons, 2023. Available online: https://www.tdcommons.org/dpubs_series/5690 (accessed on 27 March 2025).
Zhao, H.; Zhang, H.; Liu, Y.; Zhang, Y.; Zhang, X.L. Pattern Discovery: A Progressive Visual Analytic Design to Support Categorical Data Analysis. J. Vis. Lang. Comput. 2017, 43, 42–49. [Google Scholar] [CrossRef]
Kreuzthaler, M.; Pfeifer, B.; Vera Ramos, J.A.; Kramer, D.; Grogger, V.; Bredenfeldt, S.; Pedevilla, M.; Krisper, P.; Schulz, S. EHR Problem List Clustering for Improved Topic-Space Navigation. BMC Med. Inform. Decis. Mak. 2019, 19, 72. [Google Scholar] [CrossRef]
Ben Abacha, A.; Yim, W.-W.; Fan, Y.; Lin, T. An Empirical Study of Clinical Note Generation from Doctor–Patient Encounters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 2291–2302. Available online: https://aclanthology.org/2023.eacl-main.168 (accessed on 5 January 2026).
Yu, D. Natural Language Processing Approaches for Monitoring Health Activities. Ph.D. Thesis, University of Michigan, Ann Arbor, MI, USA, 2024. Available online: https://hdl.handle.net/2027.42/194449 (accessed on 31 October 2025).
Abbas, A.; Afzal, M.; Hussain, J.; Ali, T.; Bilal, H.S.M.; Lee, S.; Jeon, S. Clinical Concept Extraction with Lexical Semantics to Support Automatic Annotation. Int. J. Environ. Res. Public Health 2021, 18, 10564. [Google Scholar] [CrossRef] [PubMed]
Banerjee, I.; Gensheimer, M.F.; Wood, D.J.; Henry, S.; Aggarwal, S.; Chang, D.T.; Rubin, D.L. Probabilistic Prognostic Estimates of Survival in Metastatic Cancer Patients (PPES-Met) Utilizing Free-Text Clinical Narratives. Sci. Rep. 2018, 8, 10037. [Google Scholar] [CrossRef]
Paiva, B.; Gonçalves, M.A.; Rocha, L.C.D.; Marcolino, M.S.; Lana, F.C.B.; Souza-Silva, M.V.R.; Almeida, J.M.; Pereira, P.D.; Andrade, C.M.V.; Gomes, A.G.R.; et al. A New Natural Language Processing–Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative Study. JMIR Med. Inform. 2024, 12, e54246. [Google Scholar] [CrossRef]
Abbas, A.; Lee, M.; Shanavas, N.; Kovatchev, V. Clinical Concept Annotation with Contextual Word Embedding in Active Transfer Learning Environment. Digit. Health 2024, 10, 20552076241308987. [Google Scholar] [CrossRef]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2020, arXiv:1904.05342. [Google Scholar] [CrossRef]
Zamora-Resendiz, R.; Khuram, I.; Crivelli, S. Towards Maps of Disease Progression: Biomedical Large Language Model Latent Spaces for Representing Disease Phenotypes and Pseudotime. medRxiv 2024, 2024.06.16.24308979. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
Mersha, M.A.; Gemeda Yigezu, M.; Kalita, J. Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms. Procedia Comput. Sci. 2024, 244, 121–132. [Google Scholar] [CrossRef]
Koltcov, S.; Surkov, A.; Filippov, V.; Ignatenko, V. Topic Models with Elements of Neural Networks: Investigation of Stability, Coherence, and Determining the Optimal Number of Topics. PeerJ Comput. Sci. 2024, 10, e1758. [Google Scholar] [CrossRef] [PubMed]
Mu, Y.; Tizhoosh, H.R.; Tayebi, R.M.; Ross, C.; Sur, M.; Leber, B.; Campbell, C.J.V. A BERT Model Generates Diagnostically Relevant Semantic Embeddings from Pathology Synopses with Active Learning. Commun. Med. 2021, 1, 11. [Google Scholar] [CrossRef]
Quevedo, E.; Salazar, J.Y.; Koerner, R.; Rivas, P.; Cerny, T. Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach. In Artificial Intelligence and Applications; Arabnia, H.R., Deligiannidis, L., Amirian, S., Shenavarmasouleh, F., Ghareh Mohammadi, F., de la Fuente, D., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 154–173. [Google Scholar]
Gupta, P.; Ding, B.; Guan, C.; Ding, D. Generative AI: A Systematic Review Using Topic Modelling Techniques. Data Inf. Manag. 2024, 8, 100066. [Google Scholar] [CrossRef]
Lötsch, J.; Kringel, D.; Ultsch, A. Explainable Artificial Intelligence (XAI) in Biomedicine: Making AI Decisions Trustworthy for Physicians and Patients. BioMedInformatics 2022, 2, 1–17. [Google Scholar] [CrossRef]
Jiang, L.Y.; Liu, X.C.; Nejatian, N.P.; Nasir-Moin, M.; Wang, D.; Abidin, A.; Eaton, K.; Riina, H.A.; Laufer, I.; Punjabi, P.; et al. Health system-scale language models are all-purpose prediction engines. Nature 2023, 619, 357–362. [Google Scholar] [CrossRef] [PubMed]
Johnson, R.; Gottlieb, U.; Shaham, G.; Eisen, L.; Waxman, J.; Devons-Sberro, S.; Ginder, C.R.; Hong, P.; Sayeed, R.; Su, X.; et al. ClinVec: Unified Embeddings of Clinical Codes Enable Knowledge-Grounded AI in Medicine. medRxiv 2025. [Google Scholar] [CrossRef]

Figure 1. Digital Medical Catalog Creation and Usage. Each participating hospital independently creates a local digital catalog from deidentified clinical notes. These catalogs feature only topic titles, labels, timestamps, and references, never containing patient-specific data, thus guaranteeing complete privacy and auditability. The local catalogs contribute to a federated learning framework where only model updates are shared, allowing joint optimization without centralizing sensitive information. The aggregated models support semantic trend analysis and drill-down visualization across hierarchical levels (topic, subtheme, and case pointers), enabling secure and interpretable clinical knowledge discovery.

Figure 2. Digital Medical Catalog Methodology illustrating the workflow and components.

Table 1. Confidence Metrics for the Eight Semantic Clusters. Semantic Density measures embedding-space compactness, Statistical Grounding reflects c-TF-IDF peak sharpness, and the Explainability Confidence Index (ECI) is the integrated explainability metric.

Cluster	Semantic Density	Statistical Grounding	ECI
0: Clinical Data Auditing	0.91	0.88	0.90
1: Oncology Markers	0.87	0.82	0.85
2: Prosthetic Complications	0.92	0.84	0.89
3: Cardiology	0.86	0.81	0.84
4: Infectious Disease	0.83	0.79	0.81
5: Genomic Data Security	0.89	0.86	0.88
6: Quantum Cryptography	0.88	0.84	0.86
7: Surgical Follow-Up	0.82	0.77	0.80

Table 2. Quantitative comparison of statistical (TF-IDF) and semantic (BioClinicalBERT) clustering under controlled conditions. All preprocessing, clustering algorithms, and cluster counts are identical across methods; only document representation differs. The F1-score reflects structural fidelity to a synthetic reference partition rather than clinical accuracy.

Scenario	Silhouette Score	Davies–Bouldin Index	Topic Coherence ( $C_{v}$ )	Mapped F1-Score (Weighted)
Statistical Baseline (TF-IDF/K-Means)	0.038	4.338	0.100	0.100
Hybrid Semantic Framework (BERT/K-Means)	0.239	1.611	0.533	0.877

Note: Bold values in the Semantic Framework row highlight the superior performance of this approach across all metrics, with the F1-score (0.877) demonstrating particularly strong alignment with the reference structure. For Davies–Bouldin Index, lower values indicate better cluster separation; the Semantic Framework’s reduction from 4.338 to 1.611 represents a substantial improvement in cluster quality.

Table 3. Digital Medical Catalog Themes and Trends (Hybrid Framework Output). For each discovered cluster we report the LLM-constrained topic label, normalized document counts used for comparative illustration, the percentage change across a synthetically assigned three-year span, and the primary illustrative trend insight.

Cluster	LLM-Generated Topic Label	Normalized Document Count	Percentage Change (3-Year Span)	Primary Trend Insight
0	Clinical Data Minimization and Access Auditing (HIPAA/GDPR Compliance)	2500	+5.2%	Stable growth in data access logs.
1	Oncology Marker and Cancer Detection	2500	+18.1%	Significant rise in enzyme marker mentions.
2	Surgical Planning and Prosthetic Complications	2500	+25.5%	Sharp increase in prosthetic failure events.
3	Cardiology: Acute Chest Pain and Readmission Risk	2500	−3.0%	Slight decline in acute cardiovascular cases.
4	Infectious Disease: HIV and Colorectal Screening	2500	+12.9%	Moderate growth in co-morbidity screening records.
5	Genomic Data Security and Federated Learning	2500	+8.8%	Increasing pilot usage of federated platforms.
6	Advanced Cryptography for Patient Data (Quantum)	2500	+15.0%	New research focus on quantum-secured clusters.
7	General Surgical Follow-up and Recovery	2500	−1.2%	Stable volume of routine surgical notes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ruvunangiza, J.B.; Sakuyama, C.A.V. Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data. AI Med. 2026, 1, 10. https://doi.org/10.3390/aimed1020010

AMA Style

Ruvunangiza JB, Sakuyama CAV. Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data. AI in Medicine. 2026; 1(2):10. https://doi.org/10.3390/aimed1020010

Chicago/Turabian Style

Ruvunangiza, Jeremie Biringanine, and Carlos Alberto Valderrama Sakuyama. 2026. "Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data" AI in Medicine 1, no. 2: 10. https://doi.org/10.3390/aimed1020010

APA Style

Ruvunangiza, J. B., & Sakuyama, C. A. V. (2026). Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data. AI in Medicine, 1(2), 10. https://doi.org/10.3390/aimed1020010

Article Menu

Digital Medical Catalog: Harnessing AI for Automated Classification and Analysis of Medical Data

Abstract

1. Introduction

2. Literature Review

2.1. Automated Data Cataloging in Healthcare

2.2. Data Stewardship and Governance

2.3. Database Integration and Data Management

3. Methodology

3.1. Data Source and Collection

3.2. Preprocessing Pipeline

3.3. Integrated Semantic Analytical Framework

3.3.1. Semantic Feature Engineering

3.3.2. Semantic Clustering and Hyperparameter Validation

3.3.3. Hybrid Labeling and Auditable Grounding

4. Results

4.1. Confidence Metric

4.2. Justification for Eight Clusters

4.3. Explainable AI Confidence Analysis

4.4. Cluster Confidence Summary

4.5. Quantitative Performance Metrics

4.6. Automated Classification and Topic Extraction

4.7. Synthetic Gold Standard and Extrinsic Evaluation

4.8. Temporal Trend Analysis Findings

5. Discussion

5.1. Interpretation of Principal Findings and Performance

5.1.1. Quantified Validation

5.1.2. Auditable Grounding

5.1.3. Dynamic Insight Generation

5.2. Implications for Clinical Practice and Distributed Research

5.3. Limitations

5.3.1. Imbalanced Cluster Structure in the Eight-Cluster Model

5.3.2. Embedding Bias and Topic Drift Across the Eight Clusters

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI