1. Introduction
Healthcare systems generate a large amount of unstructured data daily, including clinical notes, diagnostic reports, and imaging descriptions. Although these narratives offer rich clinical insight, their unstructured nature poses significant challenges for data management, interoperability, and large-scale analysis [
1]. The absence of standardized formats and terminologies limits the extraction of actionable knowledge and complicates data integration across institutions. Traditional manual curation remains labor intensive, error prone, and unscalable, creating persistent bottlenecks in clinical workflows. As healthcare organizations advance their digital transformation, the need for scalable and automated approaches to process unstructured medical data has become increasingly critical. These approaches are now essential to effectively handle the growing volume and complexity of medical information. This paper addresses this challenge by proposing an AI-powered Digital Medical Catalog, or DMC, designed to automatically classify, analyze, and organize diverse medical data types, thus transforming raw information into structured, actionable insights. The novelty of this work does not lie in individual NLP components such as contextual embeddings, dimensionality reduction, clustering, or keyword extraction, which are well established in existing topic-modeling frameworks including BioClinicalBERT. Rather, the contribution is a governance-oriented system design that integrates semantic representation with compliance-grade statistical grounding, an explicit Explainability Confidence Index (ECI), and a catalog abstraction that clearly separates modeling from auditability. By emphasizing traceability, confidence signaling, and structural transparency, the proposed Digital Medical Catalog is designed to support governance-ready analytics rather than to optimize task-specific topic discovery alone. While the framework is designed to enable downstream applications such as privacy-preserving analytics and federated learning, these aspects are presented as future integrations and are not evaluated in the scope of this study.
The integration of artificial intelligence (AI) and natural language processing (NLP) into clinical settings holds considerable promise, with domain-specific large language models (LLM) such as BioClinicalBERT demonstrating robust performance in the analysis of complex medical semantics and contextual relationships within medical text [
1,
2]. Nevertheless, translating these technical advances into routine practice remains hampered by persistent operational challenges. Seamlessly embedding these NLP capabilities in established clinical workflows and managing the profound heterogeneity of healthcare data sources continue to be significant barriers to widespread implementation [
3].
To address these limitations, we propose the Digital Medical Catalog (DMC), a framework that automates the classification, organization, and semantic analysis of unstructured clinical narratives [
4]. The DMC applies advanced NLP methods to group clinical texts into semantically related categories, extract and classify key medical entities (e.g., symptoms, diagnoses, treatments), and perform longitudinal trend analyzes to uncover temporal patterns in medical documentation. The resulting output is a structured and hierarchical catalog enriched with metadata such as timestamps, frequency metrics, and topic classifications [
5]. This transformation enables the rapid retrieval of clinically relevant information and supports the automated detection of emerging trends that can be overlooked in manual review. Furthermore, DMC integrates with existing health information systems to facilitate interoperability and offers a dynamic searchable interface for clinicians and researchers to explore and query medical data with unprecedented efficiency [
6]. As shown in
Figure 1, the DMC framework integrates local catalog creation, federated model training, and trend analysis into a unified pipeline, ensuring data sovereignty and auditability across institutions.
This work examines whether free-text clinical narratives can automatically be converted into a structured, auditable knowledge base that maintains semantic accuracy while supporting governance-ready analytics and temporal monitoring [
7]. To ensure the reliability of this automated processing, we introduce a validation mechanism that hybridizes semantic and statistical methods, using TF-IDF grounding to constrain the model output and mitigate factual errors [
8]. Finally, we demonstrate that this cataloged and validated data foundation enables dynamic risk monitoring, where temporal analytics uncover critical trends for decision support and public health surveillance [
9]. Together, these elements deliver a system capable of high-fidelity classification (Macro F1 = 0.877), auditable AI outputs, and proactive clinical insight generation.
The remainder of this paper is organized as follows.
Section 2 delivers an extensive examination of existing efforts in the areas of automated data cataloging, data stewardship, and the integration of databases within the healthcare sector.
Section 3 outlines our Digital Medical Catalog framework proposal, detailing the data preprocessing pipeline, the semantic feature engineering technique, and the hybrid labeling strategy employed.
Section 4 showcases the experimental outcomes, including quantitative performance metrics and an analysis of temporal trends.
Section 5 explores the implications of our research findings, the limitations of the current methodology, and potential avenues for future exploration. Lastly,
Section 6 encapsulates the principal contributions and brings the paper to a close.
The organizational structure ensures a logical flow from problem contextualization through solution development to empirical validation and critical analysis, providing readers with a clear understanding of both the theoretical foundations and practical implementations of our Digital Medical Catalog framework.
Although the underlying pipeline shares structural similarities with existing topic-modeling approaches, the distinguishing contribution of this work lies in its system-level emphasis on auditability, governance, and traceable semantic organization rather than on topic discovery performance alone.
Unlike many recent clinical NLP systems that focus on predictive accuracy or summarization, the Digital Medical Catalog is intended as an infrastructure layer that supports review, oversight, and downstream analytics through explicit confidence estimation and grounding mechanisms.
Claims regarding deployment readiness should therefore be interpreted in terms of design alignment rather than operational validation, as real-world deployment would require institutional integration, clinical oversight, and compliance verification beyond the scope of this work.
2. Literature Review
The challenge of extracting structured knowledge from unstructured clinical text has been a persistent focus of healthcare informatics research. This section reviews existing approaches to automated data cataloging, data stewardship, and database integration for healthcare data, highlighting the limitations that motivate our integrated framework.
2.1. Automated Data Cataloging in Healthcare
Traditional data catalogs function as centralized inventories for data assets, significantly improving discoverability and governance of structured data [
10]. However, their reliance on manual metadata entry makes them impractical for the volume and heterogeneity of unstructured clinical narratives [
2,
11]. This scalability gap has driven the development of automated solutions.
Recent advances in Natural Language Processing and Large Language Models offer transformative potential. Systems such as BioClinicalBERT [
1] have demonstrated robust capabilities in generating semantically meaningful representations from clinical text, providing a foundation for automated metadata extraction. These techniques surpass manual methods in both scale and accuracy. Despite this technical promise, significant adoption barriers remain, particularly in terms of seamless integration into clinical workflows and handling the wide diversity of documentation styles found in real-world settings [
12,
13].
Our Digital Medical Catalog framework advances the state of the art by integrating automated classification and topic extraction into an end-to-end system. Addresses the critical challenges of scale and heterogeneity that have limited previous approaches, while maintaining interoperability with existing Electronic Health Record (EHR) infrastructures.
2.2. Data Stewardship and Governance
In healthcare, data stewardship encompasses the entire data lifecycle, requiring stringent quality control, privacy safeguards (e.g., HIPAA, GDPR), and context-aware access management. The sensitivity of patient data requires specialized protections, including end-to-end encryption, granular audit trails, and metadata frameworks that preserve both clinical meaning and regulatory compliance.
The foundational organizational frameworks for data stewardship have been established in prior work [
10]. However, these models often depend on manual interventions, rendering them inadequate for the automated high-frequency processing demands of modern clinical data streams [
14]. Maintaining a medical data catalog is particularly resource intensive due to the convergence of heterogeneous information systems, constantly evolving medical terminology, and strict compliance requirements, a challenge also noted in ML data management contexts [
14].
Our framework addresses these limitations through an automated stewardship architecture that combines NLP-driven classification with continuous catalog synchronization. This approach enforces standardized quality protocols and immutable lineage tracking, creating governance-ready data assets that inherently satisfy healthcare’s requirements for auditability and semantic consistency.
2.3. Database Integration and Data Management
Effective use of processed clinical narratives requires storage architectures that balance high-speed data ingestion, low-latency retrieval, and sophisticated semantic query capabilities. Traditional systems often prioritized metadata management over content exploitation, but modern approaches increasingly adopt hybrid models that combine NoSQL scalability with relational query robustness.
Although previous research [
10] has demonstrated the benefits of tightly coupling data catalogs with storage systems to improve schema awareness, these frameworks require extension to handle the unique semantics of clinical data. A key challenge is to preserve the nuanced context of the original clinical documentation during the transformation to a structured, computable format; a problem also encountered in other domains of categorical data analysis [
15].
Our solution advances this paradigm by deterministically converting unstructured narratives into normalized relational records. This captures hierarchical clinical classifications (e.g., ICD-11 categories), automatically derived topics (with SNOMED CT mappings), precise temporal metadata, and complete provenance chains. This structured representation resolves the long-standing tension between narrative fidelity and computational utility, allowing efficient querying and retention of essential clinical context, as advocated in previous work on EHR data clustering for improved navigation [
16].
4. Results
Following the implementation of the Digital Medical Catalog framework and comparative processing of the clinical data set, we evaluated its performance on three dimensions: structural core validation (grouping metrics), extrinsic precision (classification metrics), and practical utility (temporal trend analysis).
4.1. Confidence Metric
We justify the choice of eight groups and present the confidence scores derived from Explainable AI (XAI) that quantify the reliability and separability of each semantic topic generated by the Digital Medical Catalog [
31].
4.2. Justification for Eight Clusters
To identify the optimal number of semantic topics, we performed a grid search across
and evaluated each configuration using Topic Coherence (
), Silhouette Score, Davies–Bouldin Index, and the Explainability Confidence Index (ECI), defined as
The configuration maximized the joint objective, resulting in the most stable, clinically coherent, and auditable topic structure. Smaller values merged distinct topics, while larger values produced fragmentation and semantic drift.
4.3. Explainable AI Confidence Analysis
Document-level confidence is computed as a weighted combination of semantic distance and statistical grounding.
with
, where
is the BioClinicalBERT embedding for document
d,
is the centroid of the assigned cluster and
quantifies the statistical keyword grounding.
4.4. Cluster Confidence Summary
These diagnostics confirm strong semantic separability, stable topic boundaries, and robust grounding in interpretable keyword patterns, validating the eight-cluster configuration for the Digital Medical Catalog.
Table 1 presents the quantitative confidence metrics for each cluster across three dimensions. Semantic Density measures embedding-space compactness (higher values indicate tighter clustering), Statistical Grounding reflects c-TF-IDF peak sharpness, and the Explainability Confidence Index (ECI) serves as an integrated metric combining both measures. The ECI values range from 0.80 to 0.90, with Clinical Data Auditing (0.90) and Prosthetic Complications (0.89) showing the strongest overall confidence, while Surgical Follow-Up (0.80) and Infectious Disease (0.81) exhibit relatively lower but still acceptable scores—likely due to greater inherent variability in these clinical domains.
4.5. Quantitative Performance Metrics
To ensure a fair comparison between statistical and semantic approaches, all preprocessing steps, clustering algorithms, and the number of clusters were held constant across baselines. The TF-IDF representation therefore differs only in the feature space used for document encoding, while all other experimental conditions remain identical.
To assess the relative effectiveness of statistical and semantic methods, we compared the Integrated Semantic Framework (Bio-ClinicalBERT) against a Statistical Baseline (TF-IDF). The metrics for structural quality (Silhouette, Davies–Bouldin) and extrinsic precision (F1-Score, Coherence) are presented in
Table 2. The results demonstrate the inadequacy of traditional statistical methods for complex clinical narratives, where the baseline achieved near random performance (F1-Score ≈ 0.100). Conversely, the Semantic Framework achieved high-fidelity classification, confirming its suitability for production-level deployment.
4.6. Automated Classification and Topic Extraction
The application of semantic clustering successfully organized the medical documents into eight highly coherent clusters (
), as confirmed by the high
score. These themes form the final, auditable categories of the Digital Medical Catalog. Cluster labels, generated using the hybrid labeling approach (constraining LLM output with statistical patterns), are detailed in
Table 3. The table lists the eight catalog themes discovered by the hybrid framework, with document counts, three-year percentage changes, and key trends insights. The framework successfully isolated both the clinical disease themes and the crucial metadata governance themes required for distributed systems research, which are often conflated in traditional models.
4.7. Synthetic Gold Standard and Extrinsic Evaluation
Because the framework operates in an unsupervised setting, no clinically annotated ground-truth labels are available for supervised evaluation. To enable extrinsic assessment of structural consistency, we construct a synthetic gold standard that serves as a controlled reference rather than a clinical truth benchmark. Documents are assigned to a fixed number of reference classes that match the target clustering configuration (eight in this study). These reference labels are generated to form coherent thematic groupings based on predefined semantic criteria, allowing for the evaluation of whether the discovered clusters recover a comparable structural organization. To align the discovered clusters with the reference classes, we applied the Hungarian matching algorithm to compute an optimal one-to-one correspondence that maximizes agreement between cluster assignments and reference labels. This procedure avoids arbitrary label permutations and ensures that evaluation metrics reflect structural similarity rather than naming conventions.
Under this construction, the F1-score quantifies the degree to which the learned semantic organization aligns with the synthetic reference partition. It does not measure diagnostic accuracy or clinical precision, but rather assesses whether the framework reliably groups semantically related clinical narratives into stable and interpretable clusters under controlled conditions.
4.8. Temporal Trend Analysis Findings
The utility of the system as a dynamic evidence stream is confirmed by the temporal analysis in
Table 3, which tracks the relative frequency trajectories of clinical topics over a synthetic three-year period. Examining the percentage change column reveals distinct patterns across domains that demonstrate the framework’s monitoring capabilities.
A critical risk signal emerges from Cluster 2 (Surgical Planning and Prosthetic Complications), which exhibits the steepest positive trajectory with a substantial increase in documentation frequency. This sharp upward trend, standing well above the average cluster change of , validates the framework’s capacity to automatically flag emerging clinical safety signals through longitudinal monitoring. Other notable trajectories include Cluster 1 (Oncology Markers) with growth, potentially reflecting increased cancer screening activities, and Cluster 6 (Advanced Cryptography) at , suggesting growing attention to quantum-secured patient data. Conversely, Cluster 3 (Cardiology) shows a slight decline (), while Cluster 7 (Surgical Follow-up) remains essentially stable (). The uniform normalized document counts (2500 per cluster) ensure these percentage changes reflect genuine shifts in thematic emphasis rather than volume differences. This visually validates the emergence of a high-priority clinical risk signal.
These findings collectively demonstrate that the Integrated Semantic Framework can automatically classify unstructured clinical notes into meaningful high-precision categories and generate dynamic longitudinal insights essential for resource management and auditability. Because timestamps are assigned synthetically, any temporal patterns described in this section should be viewed solely as illustrative examples. They serve only to demonstrate the framework’s monitoring functionality. These trajectories are not evidence of causal effects or actual clinical change; instead, they demonstrate the system’s capacity to detect and display shifts in thematic focus over time.
5. Discussion
The Digital Medical Catalog framework demonstrates the transformative potential of integrating natural language processing (NLP) and unsupervised learning to structure clinical narratives into a dynamic knowledge base. Our approach addresses a fundamental limitation in health informatics: the inability of traditional statistical methods to extract computable, semantically rich insights without sacrificing clinical fidelity. The temporal analyzes presented in this work should be understood as a validation of system capability rather than as observational evidence of clinical evolution. In real-world deployments, meaningful temporal interpretation would require institution-specific timestamps, clinical context, and governance oversight, which are beyond the scope of the present study. Moving from token co-occurrence to contextual embeddings, the system reveals meaningful patterns that are structurally validated and instantly actionable [
32,
33]. Specifically, the framework’s capacity to detect subtle semantic nuances in clinical text leads to a more precise and holistic understanding of healthcare data, which is essential for facilitating the structured analysis and evaluation of such data. Although the underlying pipeline shares structural similarities with existing topic-modeling approaches, the distinguishing contribution of this work lies in its system-level emphasis on auditability, governance, and traceable semantic organization rather than on topic discovery performance alone.
5.1. Interpretation of Principal Findings and Performance
The findings clearly demonstrate that it is essential to redirect analytical attention away from statistical NLP toward Semantic Feature Engineering [
32].
5.1.1. Quantified Validation
Our Hybrid Semantic Framework (BioClinicalBERT→ K-Means) achieved a structural and thematic consistency (F1-Score of ), a dramatic improvement over the statistical baseline (). The high Topic Coherence ( of ) confirms the semantic soundness of the eight discovered clusters, ensuring that the outputs are highly interpretable by domain experts. This strong quantitative performance validates the decision to use advanced semantic embeddings, which effectively resolve the ambiguity and jargon challenges inherent in clinical documentation.
5.1.2. Auditable Grounding
A key achievement is the implementation of the Auditable grounding approach. This method uses the stability of the TF-IDF keyword patterns to constrain the output of LLM-generated labels. As demonstrated in our testing (Scenario 4), this constraint prevents hallucination and ensures that every final topic title and the insight derived are traceable to verifiable evidence within the raw data for deploying the system in regulated environments. This mechanism provides confidence-aligned traceability that supports governance and audit requirements in regulated environments, subject to institutional validation and oversight.
5.1.3. Dynamic Insight Generation
The temporal analysis capability of the framework moves the system beyond static organization to active risk monitoring, as demonstrated in the temporal trend analysis (
Section 4.8). By tracking percentage changes across the synthetic three-year span, the framework successfully identifies illustrative temporal shifts in clinical documentation patterns.
Because timestamps are synthetically assigned, these patterns demonstrate the functionality of the temporal analytics module rather than real-world clinical or institutional evolution. Indeed, as detailed in
Table 3, notable trajectories include a substantial increase in Cluster 2 (Surgical Planning and Prosthetic Complications), signaling growing attention to surgical risks and prosthetic outcomes. Other clusters exhibit different patterns: Cluster 0 (Clinical Data Minimization and Access Auditing) shows stable but continued growth, while Cluster 3 (Cardiology) exhibits a slight decline.
5.2. Implications for Clinical Practice and Distributed Research
The practical implications of this framework directly support the transition toward decentralized, privacy-preserving healthcare.
For clinical practice, the catalog provides a mechanism for instant retrieval of similar clinical cases based on contextual meaning, rather than keyword matching, offering evidence-based decision support at the point of care. For researchers and administrators, the system functions as a dynamic surveillance tool. Automated detection of increasing complication rates, such as the increase observed in the prosthetic care group can trigger immediate reviews of surgical protocols or device efficacy, allowing proactive healthcare management.
Crucially, the high-fidelity F1-Score and Auditable Grounding ensure that the system can operate as a reliable metadata generator for advanced research initiatives, fulfilling a major requirement for secure, multi-institutional research collaboration. By transforming unstructured notes into structured, tagged data, the framework provides structured and auditable metadata that may enable future integration with Federated Learning models models for multi-institutional research, which is not evaluated in the scope of this study.
5.3. Limitations
Although the Digital Medical Catalog performed well in the eight-cluster configuration, several limitations emerged that directly affect the reliability and generalizability of the findings.
5.3.1. Imbalanced Cluster Structure in the Eight-Cluster Model
The eight clusters derived from the dataset were highly unbalanced, with sizes ranging from as few as three notes to more than twenty. Clusters with very small membership were sensitive to noise and produced unstable topic boundaries, limiting the interpretability of rare clinical themes. These imbalances also magnify fluctuations in temporal trend analysis, making conclusions about smaller clusters less reliable.
5.3.2. Embedding Bias and Topic Drift Across the Eight Clusters
Although clinical sentence embeddings improved cohesion within the eight clusters, they occasionally overemphasized lexical overlap, causing certain clusters such as oncology related themes to absorb notes with only superficial biochemical similarity. This represents a concrete failure mode where embedding bias influences the eight clusters assignments, introducing topic drift and reducing interpretability in borderline cases.
Collectively, these limitations highlight the need for larger and more diverse datasets, multi-label or hierarchical extensions of the eight clusters model, and domain adaptive embedding strategies to improve robustness and real world applicability.
6. Conclusions
The Digital Medical Catalog framework presented in this work successfully addresses a critical and persistent challenge in modern healthcare: the efficient transformation of unstructured clinical narratives into a structured, computable knowledge base without sacrificing the essential semantic nuance required for high-quality patient care. By integrating Semantic Feature Engineering (Bio-ClinicalBERT) with auditable statistical methods, the system moves past the limitations of traditional statistical NLP, which proved inadequate for clinical complexity (F1-Score ≈ 0.100).
The core achievement of our framework is demonstrated by its high-fidelity classification capability, which achieves an empirically validated F1-Score of 0.877 and a strong Topic Coherence () in eight clinically relevant topics. This robust performance enables systematic organization of complex documentation and facilitates the discovery of longitudinal insights, such as the clear surge in documentation related to Surgical Planning and Prosthetic Complications.
Crucially, the Hybrid Labeling and Auditable Grounding approach transforms the system into a compliance-grade solution. By constraining LLM-generated topic titles with verifiable TF-IDF patterns, the framework mitigates hallucination risk and provides the necessary traceability for audit logs in sensitive, distributed healthcare environments.
Our current findings show significant strength in processing discrete clinical concepts. However, limitations remain when handling the intricate, multi-faceted narratives typical of patients with comorbidities, where single-label clustering proves an oversimplification. These challenges illuminate a clear path for future work:
Privacy-Preserving Architecture—Development of a Federated Learning architecture for secure, multi-institutional model collaboration, strengthened by advanced cryptographic techniques to ensure HIPAA and GDPR compliance.
Multifaceted Classification—Implementation of multi-label and hierarchical clustering approaches to accurately capture the complexity of patient records and correlation between clinical themes.
Multimodal Evolution—Expanding the framework into a comprehensive clinical intelligence platform by incorporating other data modalities, such as medical imaging reports and structured lab data.
This work establishes that context-aware computational approaches to clinical text analysis need not force a binary choice between analytical rigor and clinical fidelity. With careful and context-aware design, our framework paves the way for AI systems that serve as genuine, auditable partners in the mission to improve healthcare delivery.