Next Article in Journal
FPGA-Accelerated Machine Learning for Computational Environmental Information Processing in IoT-Integrated High-Density Nanosensor Networks
Previous Article in Journal
TVAE-GAN: A Generative Model for Providing Early Warnings to High-Risk Students in Basic Education and Its Explanation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

KOSMOS: Ontology-Based Knowledge Graph Scaffolding for Medical Documentation Generation

Department of Computer Science, The University of Alabama, Tuscaloosa, AL 35487, USA
*
Author to whom correspondence should be addressed.
Information 2026, 17(4), 355; https://doi.org/10.3390/info17040355
Submission received: 5 March 2026 / Revised: 28 March 2026 / Accepted: 1 April 2026 / Published: 8 April 2026

Abstract

We investigate whether an ontology-typed knowledge graph (KG) can improve SOAP note generation from clinician–patient encounter transcripts by serving as a structured intermediate representation that organizes clinically salient content while preserving provenance. We introduce Knowledge graph Ontology Supported Medical Output System (KOSMOS), which extracts typed clinical entities with attributes and relationships, grounds entities to UMLS concepts and a schema, and retains links to supporting transcript turns. The resulting graph is provided as context for large language model (LLM)-based SOAP generation either alone (KG-only) or combined with the original transcript (Transcript + Nodes, Transcript + KG). We evaluate these conditions against DocLens and Ambient Clinical Intelligence Benchmark (ACI-BENCH) baselines on their benchmark, claim, and citation analyses. Across all three test sets, transcript-inclusive KOSMOS variants achieve the highest raw scores, numerically exceeding the transcript-only baselines. Claim-level evaluation shows modest, non-significant recall gains for Transcript + Nodes and low hallucination under transcript-conditioned GPT-5.2, while citation analysis shows about a 3% accuracy gain for KOSMOS (Transcript + KG) over DocLens GPT-5.2. Overall, ontology-guided KG structure appears most beneficial as a complementary scaffold paired with transcript access, while relationships provide limited additional benefit under current extraction quality.

1. Introduction

1.1. Motivation and Background

Generative artificial intelligence (AI) is increasingly used to automate document creation and transformation in both general-purpose and workflow-specific settings. Controlled studies show substantial time savings with maintained or improved quality on writing tasks [1]. Many fields employ specialized workers who spend substantial time on documentation rather than applying their core expertise. Across settings, users expect outputs to be accurate and useful. However, in high-stakes domains such as medicine and law, the consequences of errors are severe enough to demand stricter standards for correctness, traceability, and auditability. Satisfying these standards depends on appropriate constraints during generation. Without such constraints, modern models can produce text that is coherent yet not supported by the source material, and evaluations of leading systems in high-stakes applications have reported nontrivial rates of hallucinations and omissions [2]. Reliability mechanisms are therefore central to practical deployment in these domains. Prior work highlights complementary strategies, including evidence-grounded generation with explicit citations [3], generation conditioned on intermediate representations [4], and automated self-checking and verification [5]. In settings where a single unsupported statement can cause harm, these techniques are not optional refinements but core design constraints.
Healthcare presents an especially consequential case because clinical documentation simultaneously drives administrative burden and clinical decision-making. Physicians report high rates of burnout, and large surveys consistently identify administrative work associated with electronic health records (EHRs), including after-hours documentation, as a major contributor [6,7,8]. Time-motion studies and EHR log analyses indicate that clinicians devote a substantial share of their workday to documentation, order entry, and inbox management, often rivaling or exceeding time spent in direct patient care [9,10]. This burden erodes time available for patient interaction, contributes to fatigue, and can degrade cognitive performance in ways that increase the likelihood of error. Meanwhile, clinical documentation such as Subjective, Objective, Assessment, and Plan (SOAP) notes coordinates care across teams, supports billing, and records the rationale for decisions. Accuracy is essential because omissions or misstatements can propagate to downstream care, follow-up actions, and medication management [11,12]. These conditions make transcript-to-SOAP generation a promising target for AI assistance, but only if the system can remain faithful to encounter evidence and make that faithfulness easy for clinicians to verify.
Despite the growing use of generative AI for document work, producing high-quality clinical documentation from encounter transcripts remains difficult and current systems still exhibit both omission and hallucination errors [13,14,15]. Clinical conversations are not structured narratives. They include interruptions, false starts, corrections, and informal phrasing [13,15]. Key facts are often distributed across many turns rather than stated once in a clean summary form, which increases the risk that important details are missed or inconsistently carried forward [14,16]. Speakers also rely heavily on context through pronouns, shorthand references, and implied causality, which makes it easy for a model to misattribute who experienced a symptom, who recommended an action, or when a medication change occurred. These properties make transcript-to-SOAP generation a challenging setting for faithful summarization because success requires high recall and correct attribution while avoiding unsupported additions, not just fluent text [14,15].
A second challenge is verification [3,17,18]. Even when a generated note is fluent and clinically plausible, clinicians must be able to quickly determine which statements are supported by the encounter and which clinically important details were not carried forward [3,17,18]. This review problem is distinct from the inherent difficulty of processing dialogue. It stems from the mismatch between a long, messy evidence source and a compact artifact that must be trusted like a SOAP note. The challenge is amplified by SOAP structure because verification requires not only factual correctness but also correct provenance and placement, for example, ensuring that patient-reported symptoms remain in Subjective while clinician interpretations and instructions appear in Assessment and Plan, and that concrete items such as medication regimens, test values, and follow-up instructions are traceable [14,19]. To improve auditability, some systems require evidence-grounded outputs in which each sentence is linked to supporting transcript turns, with DocLens providing a representative citation-based grounding and evaluation framework [3].
Despite rapid progress in dialogue-to-clinical-note generation, transcript-to-SOAP systems still fall short of the reliability required for widespread clinical use [12]. Existing transcript-only approaches continue to exhibit omissions, unsupported additions, and traceability challenges, and improving auditability through turn-level provenance can help clinicians review outputs but does not by itself resolve these underlying quality failures [3,14,15,20]. However, a gap in prior work is not clearly establishing whether an encounter-specific, ontology-grounded knowledge graph derived from the dialogue itself can improve transcript-to-SOAP generation as intermediate context. Such structure could plausibly help by consolidating details scattered across long conversations, standardizing vocabulary, and surfacing clinically salient facts in a form that is easier for the model to access and organize consistently. This makes structured, ontology-guided intermediate context a plausible direction to investigate for improving factual quality under both benchmark- and claim-level evaluation.

1.2. Related Work

Our work builds on three key areas: knowledge graphs (KGs) and ontology grounding for structuring clinical facts, context-aware KG construction from unstructured dialogue, and medical document generation from clinical conversations. Much of the related literature either generates notes directly from transcripts or uses curated knowledge resources and KGs as external references for retrieval, normalization, or evaluation. In contrast, our setting examines whether an encounter-specific, ontology-aligned representation can replace or augment the transcript as intermediate context for SOAP generation.

1.2.1. Ontology-Grounded Clinical Knowledge Graphs for Generation and Factuality

Knowledge graphs represent information as entities linked by typed relationships and can also encode structured attributes such as time, measurement values, or modifiers [21]. In clinical settings, KGs are commonly paired with schemas and ontologies to normalize terminology and constrain semantically plausible entity types and relations [21]. Aligning mentions to standardized terminologies such as Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT), RxNorm, and Logical Observation Identifiers Names and Codes (LOINC) supports interoperability and reduces variation in how the same concept is expressed across speakers and notes [22,23,24,25]. In practice, these resources are often accessed through the Unified Medical Language System (UMLS), which integrates biomedical vocabularies and supports mapping from surface text to normalized concept identifiers, improving downstream consistency by linking colloquial phrasing such as “can’t catch my breath” to stable clinical concepts like dyspnea [26,27].
Ontology grounding also introduces clinically meaningful failure modes that matter for downstream generation. Conversational language is frequently underspecified, and mapping a vague mention to a fine-grained concept can add unsupported specificity. Large-scale mapping work highlights the difficulty of choosing the correct level of specificity among semantically close SNOMED CT options when evidence is incomplete [28]. Reliability-focused evaluations therefore treat unsupported added detail as a distinct error type and motivate conservative normalization criteria when grounding dialogue into standardized concepts [18]. These tradeoffs motivate careful use of ontology-aligned structure, as normalization can improve consistency and organization, but overly aggressive grounding can surface as confident yet incorrect statements in the final note.
Beyond normalization, structured medical knowledge has also been used as a control signal for generation, either by retrieving context from curated global KGs or by constructing encounter-specific representations. A common design pattern retrieves biomedical concepts and relations from global resources grounded in UMLS or SNOMED CT and uses them as additional context for generation or explanation [26,29]. For example, Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) retrieves prompt-aware biomedical context from a curated KG to encourage grounded outputs with traceable support [29], while related work conditions lab interpretation generation on structured context linking conditions, lab results, demographics, and reference ranges [30]. These approaches are most useful when the output requires knowledge beyond the encounter, but they can also introduce plausible background information that is not patient-specific.
Other work uses KGs to evaluate factual accuracy rather than generate text. For example, in radiology report generation, researchers extract entity-relation graphs from both generated and reference reports, then measure factual accuracy by comparing how well the findings, anatomy, and modifiers align [31,32,33]. Larger resources such as RadGraph-XL extend this paradigm across modalities and body regions, providing broader supervision and more standardized factual comparisons [34]. Together, this literature motivates our focus on encounter-specific, ontology-aligned intermediate context, since the transcript already contains patient evidence but can be difficult to exploit reliably without explicit consolidation, typing, and conservative grounding.

1.2.2. Context-Aware KG Construction from Text and Dialogue

Constructing a KG from unstructured text typically involves mention extraction, entity linking or disambiguation, and relation extraction [35]. Dialogue increases difficulty because evidence is distributed across turns and expressed using pronouns, shorthand, and ellipsis, motivating decontextualization that rewrites utterances into more self-contained statements before extraction [36,37]. Pipelines also consolidate repeated mentions map into a stable set of entities and relations for downstream reuse.
KG construction methods increasingly move beyond sentence-local tuples toward context-aware pipelines that consolidate entities and relations across longer spans of text. Early schema-free extraction approaches such as OpenIE produce local triples that can fragment into redundant nodes and miss cross-sentence relations [38]. More recent systems use larger context windows and LLM-based extraction, then apply clustering and consolidation to improve coherence and reduce redundancy [35,39]. These approaches also differ in their design goals: some prioritize predicate-typed semantics for explicit reasoning, while others build graphs primarily as retrieval scaffolds, where edges encode association strength or free-text relation descriptions rather than a closed set of typed predicates [40].
Dialogue-focused work further shows that cross-turn reasoning is essential because many relations are expressed indirectly and require resolving anaphora and references to earlier turns. Systems that explicitly link utterances and entities across turns can improve relation decisions, but performance degrades under noisy transcription, imperfect diarization, or implicit clinical reasoning that is not stated verbatim [41]. Ontology-guided constraints provide a complementary mechanism by enforcing valid types and concept inventories, yet relation inference remains challenging when support is dispersed or depends on clinician interpretation rather than explicit wording [42]. Overall, the literature suggests that usable graphs require cross-turn context and consolidation, but robustness, provenance fidelity, and relation accuracy remain open challenges.

1.2.3. Transcript-to-SOAP Generation, Ambient Scribes, and Evaluation

Clinical note generation from clinician–patient dialogue has been studied using supervised summarization models and prompting-based approaches, including section-wise generation that aligns outputs with the Subjective, Objective, Assessment, and Plan structure [14,15]. While sectioned generation can improve organization, transcript-to-SOAP systems still exhibit common failure modes including omissions, incorrect attribution, and plausible but unsupported additions, reflecting the interrupt-driven and distributed nature of real encounters [16]. These shortcomings motivate intermediate-structure approaches that make patient-specific entities and relations explicit, rather than relying on the model to maintain structure implicitly across long dialogue context.
Ambient scribe deployments reinforce that factual coverage and correctness remain practical bottlenecks. Modern systems generate fluent drafts and often provide evidence links to support review, yet independent evaluations show mixed results on efficiency and persistent inaccuracies that require clinician editing [43,44]. Cross-system simulated evaluations also report substantial variation in accuracy and safety, including omitted or incorrectly recorded clinical elements with nontrivial harm potential [45]. Together, these findings indicate that transcript-to-SOAP systems need better preservation of visit-specific facts and stronger control of unsupported additions to minimize the need for downstream review.

1.3. Our Approach and Contributions

We introduce Knowledge graph Ontology Supported Medical Output System (KOSMOS), an approach that uses an ontology-typed KG as an intermediate representation between the encounter transcript and the SOAP note. To the best of our knowledge, this is the first attempt to use encounter-specific KGs in this way for clinical documentation. The transcript provides the most faithful record of what was said, but it is an inefficient format for organizing entities and clinical details across a long note. The KG functions as a structured control layer that organizes encounter content into a machine-actionable form for guiding generation.
The graph represents the encounter as a set of typed nodes with attributes, and relationships. Nodes correspond to clinically meaningful entities such as problems and symptoms, medications, laboratory tests, procedures, activities, and people involved in the encounter. Each node carries structured attributes that capture details often scattered across turns in conversation, including medication dosage and frequency, laboratory values and units, and coarse temporal status such as prior history, current encounter, or planned future events. Relationships connect nodes when the encounter implies or directly states a clinically meaningful linkage, such as a medication being prescribed by the clinician, a symptom being reported by the patient, a laboratory value being associated with a specific test, or a condition being assessed as part of the clinician’s plan.
We ground graph nodes in established clinical ontologies by normalizing entity names to standardized terms from UMLS [26]. When the LLM generates an entity from transcript mentions, we map that entity to its corresponding UMLS concept to ensure consistent, recognizable clinical naming. This ontology grounding anchors entities to concepts that are widely recognized in clinical databases and documentation systems. The graph structure itself is governed by a schema that defines which node classes, attributes, and relationship types are permitted, providing structural constraints that complement the ontological grounding of entity names. Together, these mechanisms support downstream processing by improving interoperability and reducing ambiguity in entity references across the pipeline.
Graph construction preserves provenance by retaining explicit links from each node, attribute, and relationship to the transcript turns that support it. This provenance enables evidence-grounded generation in which note statements can be traced back to the encounter record, and it supports evaluation procedures that assess whether generated content is supported by the underlying conversation. Figure 1 summarizes the KOSMOS pipeline.
Within this landscape, we formulate transcript-to-SOAP generation as a conditional generation problem with an explicit intermediate structure. Let T denote a clinician–patient transcript and let g ( T ) = G be an encounter-derived knowledge graph G = ( V , E ) whose nodes are assigned ontology types and whose edges encode typed clinical relations. Given a context representation C, a generator f θ produces a SOAP note Y = f θ ( C ) . In this formulation, the DocLens prompting strategy corresponds to the transcript-only baseline C = T , where the model must infer and maintain clinical structure implicitly from raw dialogue [3]. Our contribution is to test how replacing or augmenting this baseline context with an ontology-typed encounter graph changes the factual behavior of f θ and to quantify the tradeoff between information loss, structural bias, and robustness to graph construction errors.
We compare three KG context conditions against the DocLens transcript-only baseline ( C = T ): KG only ( C = G ), transcript plus KG nodes ( C = T V ), and transcript plus full KG ( C = T G ). We evaluate the baseline using both an older model matching the original DocLens configuration and a modern model, while all KG conditions use only the modern model. We assess performance using the DocLens metrics and the established ACI Benchmark evaluation framework [3,20].

1.3.1. Research Question

To what extent does using an ontology-typed encounter knowledge graph as intermediate context improve alignment between generated SOAP notes and expert-written notes in terms of surface-form similarity, semantic content fidelity, and claim-level factual support with traceable provenance?

1.3.2. Purpose

This paper investigates whether an encounter-specific, ontology-guided knowledge graph can serve as a useful intermediate context for transcript-to-SOAP generation. We evaluate whether KG-based conditioning is associated with better alignment with expert-written SOAP notes, improved evidence traceability, and lower unsupported content relative to transcript-only prompting, using ACI-BENCH automatic metrics and DocLens-style claim- and citation-level evaluation.

1.3.3. Hypotheses

Let A ( C ) denote the ACI-BENCH Average score under context condition C, R ( C ) the DocLens claim recall, P ( C ) the DocLens claim precision, H ( C ) the DocLens hallucination rate, and Q ( C ) the citation accuracy. We use T for transcript-only context, G for full KG-only context, V for KG nodes only, T V for transcript plus nodes, and T G for transcript plus the full KG. We test the following hypotheses.
Hypothesis 1a (H1a).
Relative to transcript-only prompting, KG-only context will reduce unsupported content:
H ( G ) < H ( T ) , P ( G ) > P ( T ) .
Hypothesis 1b (H1b).
Relative to transcript-only prompting, KG-only context will reduce completeness and benchmark alignment:
R ( G ) < R ( T ) , A ( G ) < A ( T ) .
Hypothesis 2a (H2a).
Transcript plus full KG will yield the strongest benchmark-level performance:
A ( T G ) > A ( T ) , A ( T G ) > A ( G ) .
Hypothesis 2b (H2b).
Transcript plus full KG will yield the strongest claim-level factual tradeoff, improving coverage and provenance without reducing precision:
R ( T G ) > R ( T ) , Q ( T G ) > Q ( T ) ,
and
P ( T G ) P ( T ) , H ( T G ) H ( T ) .
Hypothesis 3a (H3a).
Transcript plus KG nodes will yield intermediate benchmark-level performance between transcript-only and transcript plus full KG:
A ( T ) A ( T V ) A ( T G ) .
Hypothesis 3b (H3b).
Transcript plus KG nodes will yield intermediate claim-level and provenance performance between transcript-only and transcript plus full KG:
R ( T ) R ( T V ) R ( T G ) ,
P ( T ) P ( T V ) P ( T G ) ,
Q ( T ) Q ( T V ) Q ( T G ) ,
and
H ( T ) H ( T V ) H ( T G ) .
    The remainder of this paper is organized as follows. Section 2 describes the KOSMOS pipeline, including ontology-typed KG construction, provenance tracking, and the SOAP generation setups and baselines used in our evaluation. Section 3 reports quantitative and qualitative findings across the transcript-only and knowledge-graph context conditions. Section 4 interprets the results, highlights tradeoffs and failure modes, and situates the findings within prior work on medical documentation generation. Section 5 summarizes the main contributions and outlines directions for future work.

2. Materials and Methods

We developed an end-to-end system that generates SOAP clinical notes directly from encounter transcripts. The method combines an ontology-guided KG representation of the conversation with LLM generation of the SOAP sections. The KG serves as the structured intermediate layer that organizes transcript evidence into typed clinical entities, attributes, and relationships, which are then used to produce the Subjective, Objective, and Assessment and Plan components of the note. This section first describes the KG construction pipeline, beginning with mention extraction and entity typing, followed by mention consolidation into canonical entities, ontology-based node classification and attribute assignment, and relationship identification. We then describe how the resulting KG is used to generate each SOAP section separately, including the division of the note into three generation stages to improve control and consistency.

2.1. System Overview

Our approach generates a SOAP clinical note from an encounter transcript by constructing an ontology-based KG and then using the KG, optionally combined with the transcript, to produce a single-pass SOAP note. The system takes as input a transcript segmented into ordered turns and produces two primary artifacts. The first is a structured KG containing typed nodes with attributes and semantic edges. Figure 2 illustrates a simple encounter KG where typed entities such as patients, conditions, tests, and medications are connected through clinically meaningful relationships like diagnosed and ordered_test The second is a SOAP note whose statements include explicit provenance links to their supporting inputs.
All LLM-based stages in the pipeline used GPT-5.2-2025-12-11, including transcript preprocessing, mention extraction and typing, candidate normalization, attribute completion, relation labeling, and SOAP note generation. Generation settings were held fixed across conditions to ensure comparability. We set the temperature to 0.0 and the maximum output length to 128,000 tokens. All prompts, configurations, and code needed to reproduce the pipeline are released in our public repository on GitHub at https://github.com/ryanwaynehenry/KOSMOS (accessed on 15 January 2026).
Figure 3 summarizes the KG construction workflow. Each stage box shows the operation performed at the top and the intermediate artifact created or updated at the bottom. Starting from the transcript, KOSMOS first extracts span-level mentions and assigns each mention a coarse entity type such as person, medication, or lab test. These typed mentions are then consolidated into contextual candidates that group occurrences based on clinical context and documentation requirements rather than abstract conceptual identity. For example, dose and frequency mentions are grouped with the specific medication regimen instance they describe, while measurement values are grouped with the corresponding laboratory test. Each candidate is then passed through UMLS-based ontology grounding to map the group to a normalized concept when possible. Finally, grounded concepts are instantiated as KG nodes with class-consistent attributes, yielding typed nodes whose fields are populated from the mentions grouped under that node.
After node creation, the system identifies candidate relationships between nodes using transcript structure. Candidate node pairs are proposed when the supporting mentions for two nodes occur within one neighboring turn of each other, reflecting local discourse adjacency. In addition, the patient node and the clinician node are treated as potential hubs. The system considers potential relationships between these hub nodes and every other node in the graph to capture global patterns that may not be localized to adjacent turns. For each candidate pair, the system selects a typed relationship using an LLM under an explicit constraint set, including “no relationship”. The allowable relationship labels are determined by the ontology classes of the two nodes, which limits outputs to clinically plausible relation types for the involved concepts.
Finally, the system generates the SOAP note using the completed KG as structured input, optionally combined with the transcript itself. We generate the note in one pass using two-shot prompting under three context conditions. The first uses KG only, the second uses KG plus transcript, and the third uses KG node summaries plus transcript. In all settings, the generated note includes explicit provenance for each statement. When transcript context is provided, statements are annotated with the supporting transcript turns. When only KG context is provided, statements are annotated with the supporting node and relationship identifiers.

2.2. Transcript Preprocessing and Turn Segmentation

The pipeline begins by converting each raw encounter transcript into a structured, turn-based representation that can be used consistently across downstream extraction and graph construction. Transcripts are first normalized to a consistent speaker prefixed format so clinician and patient identifiers follow a shared convention. The transcript is then segmented into an ordered sequence of turns by merging consecutive lines from the same speaker, removing empty lines, and assigning sequential turn IDs to preserve encounter order.
Because clinical conversations frequently rely on pronouns and deictic references, we include a pronoun-resolution pass to improve grounding and reduce ambiguity in subsequent mention extraction and relation inference. In this step, an LLM rewrites each turn by replacing pronouns with explicit noun phrases when the referent is reasonably clear from context. This rewrite is constrained to preserve the original structure, including the number and order of turns and the speaker label for each turn. The rewrite targets pronouns referring to both participants and clinical concepts, such as conditions, medications, tests, results, and plans. When a referent is ambiguous, the original pronoun is preserved to avoid introducing unsupported assumptions. This preprocessing produces a turn sequence that remains faithful to the transcript while making core referents more explicit for downstream span extraction and graph construction.

2.3. Mention Extraction and Typing

After preprocessing, the system identifies clinically relevant entity mentions in the transcript and assigns each mention a coarse semantic type from a table of options. We perform both tasks using an LLM-based extraction procedure because it provided substantially better mention coverage than a purely model-based alternative using the scispaCy en_core_sci_lg model in our preliminary testing. This extraction stage prioritizes comprehensive capture of all clinically relevant information. Every reference is retained, including repeated mentions across turns, negated findings, and routine contextual spans that later support normalization and relation inference.
Mentions are extracted as short, atomic spans that correspond to a single concept at a time. The extraction instructions emphasize minimal spans that still uniquely identify the entity and explicitly prohibit bundling multiple concepts into a single mention. For example, medication name, dose amount, and frequency are extracted as separate spans, and measurement labels and numeric values are extracted separately. Time expressions are extracted as their own spans to preserve temporal grounding without collapsing them into event phrases. This atomicity is important because later stages rely on recombining spans into structured entities with consistent attributes.
To maximize mention coverage and maintain faithfulness to the transcript, turns are processed in bounded batches. Preliminary testing showed that extracting mentions from smaller sections of the transcript and then merging the resulting lists yields more total mentions than running extraction over the full transcript in a single pass. Each extracted mention is tagged with the identifier of the turn in which it appears, preserving provenance and enabling subsequent grouping and relation-candidate selection based on discourse locality.
Each extracted mention is assigned exactly one entity type from a predefined set covering core clinical categories and supporting context. The type system includes problems, medications, laboratory tests, procedures, observed values, units, dose amounts, frequencies, time expressions, activities, people, and other. The typing step is constrained so that entity types must be selected from the provided list and each mention must be preserved without merging, splitting, or filtering. Spans that do not match a specific clinical category but remain contextually relevant are assigned a general fallback type of “other” rather than being removed.
Because many downstream relations are anchored on the participants, the system ensures explicit mentions corresponding to the patient and clinician are available to the KG construction stage. These participant mentions act as stable anchors for connecting symptoms, medications, tests, and care-plan items to the appropriate actor, even when the transcript primarily uses pronouns or implicit references. The output of this stage is a turn-aligned list of typed mention spans that preserves the original conversational structure while making clinical concepts explicit and machine-actionable for later consolidation into canonical entities.

2.4. Mention Grouping into Candidates

The typed mention list from the previous stage contains many redundant and partial references that must be consolidated into canonical entity candidates before node construction. We perform this consolidation using an LLM-driven normalization procedure that groups mentions referring to the same underlying concept into a single entity object. This stage is structurally exhaustive. Every input mention is assigned to exactly one entity, and no new mention spans are introduced.
To remain within model context limits while maintaining discourse coherence, mentions are processed in batches that respect transcript turn boundaries. Batches target a fixed size range, but splitting only occurs when the mention stream transitions to a new turn. This prevents a single turn’s mentions from being split across batches, which would otherwise increase the risk of incorrect partitioning for tightly coupled items such as medication instructions, test names with values, or short anaphoric references.
For each batch, the model receives the full preprocessed transcript for context along with the batch mention list containing identifiers, turn provenance, mention text, and entity type. The model outputs a set of entity candidates where each entity contains a canonical name, an entity type, the set of turn identifiers in which it is referenced, and the list of original mention objects assigned to it. The procedure enforces a strict one-to-one assignment constraint where every mention identifier must appear exactly once across all output entities. This ensures that later stages can trace all structured information back to explicit transcript spans.
Grouping applies rule-based constraints to align entities with clinical documentation conventions and to support reliable attribute extraction later in the pipeline. Temporal expressions (dates, durations, time spans) are treated as modifiers rather than standalone concepts, so they are attached to the clinical concepts they qualify and entity names are never primarily temporal. Medication mentions are grouped by regimen instance rather than drug name alone, so the same drug can be separated into distinct groups for the patient’s current regimen versus a newly prescribed, updated, or previous regimen. Laboratory tests and measurements are grouped so that test names and their associated values form a single test entity, ensuring results remain anchored to the correct test concept.
The grouping step is designed to avoid both over-splitting and over-merging. For symptoms and exam findings, we reduce fragmentation by combining mentions only when they clearly refer to the same concept, such as descriptive variants that share the same anatomical site term, i.e., chest pain and chest discomfort. In contrast, we prohibit merges that would erase clinically meaningful distinctions. Diagnoses are kept separate from symptom complaints even when related, symptoms from different anatomical sites are never merged, and distinct symptom categories are not merged merely because they appear in the same turn. This separation is important for downstream SOAP generation because symptoms primarily reflect what the patient reports (subjective complaints), whereas diagnoses reflect the clinician’s assessment. If these are merged, the generator can blur reported experience with inferred or confirmed conditions, making it harder to place content correctly and to distinguish symptoms the patient actually endorses from symptoms that are merely plausible consequences of a diagnosis. Relatedness is represented through explicit relationships, allowing the note to express connections without collapsing patient-reported and clinician-assessed content into the same group.
After initial batching, we add a temporal context label for medication and laboratory test candidates to prevent invalid merges in the final consolidation step. Using the full preprocessed transcript as context, the system assigns each medication or lab test entity a coarse temporal label indicating whether it refers to prior history, the current encounter, a future planned event, or an uncertain timeframe. Because entity normalization is performed in batches, a final cross-batch consolidation step then identifies candidates that should be merged across batch boundaries. This step proposes merge sets over candidates from different batches while preserving the original mention assignments and preventing overlapping merges. Proposed merges are filtered to exclude those that violate core constraints, such as combining medications across regimen categories or laboratory tests with incompatible temporal contexts. The output is a globally consolidated set of entity candidates, each with a canonical name, an entity-level type, and a bundle of grounded mention spans that retain their individual types.

2.5. Ontology Grounding and Concept Normalization

After grouping mentions, entity candidates are grounded to standardized biomedical concepts where applicable to support ontology-aware node attributes and downstream constraints. We performed concept normalization against the UMLS Metathesaurus using a local MySQL deployment of UMLS tables populated from the MRCONSO source file [26]. MRCONSO serves as the primary lexical index in UMLS, linking surface strings (STR) to Concept Unique Identifiers (CUIs) and providing metadata such as the source vocabulary (SAB), term type (TTY), and language (LAT).

2.5.1. Database-Backed Normalization (Primary Lookup)

For each non-person entity candidate, the system first attempts a direct lexicon lookup by querying MRCONSO in the local MySQL instance. Candidates are normalized by matching the candidate string against MRCONSO strings and restricting retrieval to English, non-suppressed entries. The lookup is additionally constrained to clinically relevant source vocabularies used in this work, including SNOMED CT for problems and findings, RxNorm for medications, and LOINC for laboratory tests.
When multiple MRCONSO rows match or near-match a candidate, we rank candidates using conservative lexical similarity with light preference for terminology-preferred strings when available. The highest-scoring concept is selected if its similarity score exceeds a minimum threshold and is stored with its CUI and source vocabulary so that the graph node is tied to a standardized identifier rather than only a free-text label. Candidates whose entity types are inherently local to the encounter (e.g., patient and clinician entities) are excluded from ontology grounding because they do not correspond to global biomedical concepts in UMLS.
We also normalize measurement units when present by mapping unit strings to Unified Code for Units of Measure (UCUM) representations. This reduces spurious mismatches caused by spelling and formatting variations in units and supports consistent downstream handling of laboratory values and vital signs.

2.5.2. Embedding-Based Fallback Retrieval Using FAISS

Exact or near-exact string matching is not always sufficient in conversational transcripts due to abbreviations, paraphrases, partial strings, and informal phrasing. To improve robustness, we implemented a semantic fallback retrieval method based on dense embeddings and nearest-neighbor search. We constructed a vector index from MRCONSO strings and their associated CUIs and source vocabularies, then used similarity search to retrieve the closest ontology strings when the primary database lookup was low-confidence or empty.
For embedding generation, we used SapBERT, a biomedical encoder trained to align UMLS synonyms in embedding space, which is well suited to mapping surface forms to ontology concepts. We embedded MRCONSO strings offline and stored them in a FAISS index. FAISS is a vector similarity search library designed for efficient nearest-neighbor retrieval over large collections of dense vectors, supporting exact and approximate search strategies for high-dimensional embeddings. In our implementation, we used inner-product search over normalized embeddings so that nearest neighbors correspond to cosine-similar terms.
At inference time, when fallback retrieval is triggered, the entity candidate string is embedded with the same SapBERT encoder and queried against the FAISS index to obtain the top semantic matches. The highest-scoring match above a fixed similarity threshold is used to propose a grounded concept; otherwise the entity remains ungrounded and is carried forward as a surface-form node.

2.6. Node Construction and Attribute Assignment

After entity candidates are consolidated and grounded to ontology concepts when possible, the system converts each candidate into a KG node with a schema-guided set of attributes. This step transforms a variable collection of transcript mentions into a standardized node representation that can be validated, searched, and used as structured context for SOAP generation.
Each candidate is mapped to a node class that defines which attribute fields are allowed, ensuring structurally consistent nodes across the graph. Attributes are then populated using a constrained extraction procedure rather than free-form generation. For each candidate, the system builds a local evidence window consisting of the turns containing any of the candidate’s mentions plus the immediately adjacent turns. This evidence window is provided along with the candidate’s canonical name, its class, and the allowed attribute keys for that class. The LLM is instructed to populate only those keys and to omit any attribute not supported by the evidence, producing structured, grounded attribute dictionaries.
These class-specific attribute schemas encode clinical documentation conventions. For example, medication nodes capture regimen-level attributes, including whether the regimen is current, newly prescribed, updated, or discontinued, so dosing details attach to the correct regimen instance. Laboratory test nodes treat the test name as the anchor and attach observed values and units as attributes of that test entity. Person nodes store demographic descriptors as attributes of the referenced individual rather than as standalone nodes.
Each node stores its class label, canonical name, attributes, mentions, and the transcript provenance needed to trace it back to the source turns of its constituent mentions. After attribute extraction, we clean and standardize the node by keeping only schema-allowed attribute keys and filling minimal required defaults when needed, such as deriving a person name from the canonical name. Finally, we validate the node against the schema so invalid keys or incompatible attribute formats are removed before relationship extraction and SOAP generation.

2.7. Relationship Generation

Given the finalized set of nodes, the pipeline constructs candidate node pairs and assigns a single directed clinical relation (or no relation) to each pair using an LLM with schema-constrained outputs. This step produces the typed, directed edges that connect entities into an encounter graph, as illustrated in Figure 4. We generate candidates in two stages. First, a pair of nodes is proposed for relationship consideration if at least one mention from each node occurs in the same turn or in adjacent turns. Second, we enforce coverage for the core person entities by forming node pairs with the patient and clinician nodes and every other node, even if they are not adjacent in the transcript. Because the patient and clinician are the primary participants in the encounter, they are likely to have meaningful relations with entities throughout the graph regardless of mention proximity. This ensures that relations such as diagnoses, endorses, or has_condition are always considered for them, which is reflected in the dense patient and clinician connections in Figure 4.
Relation labeling is performed within a fixed list of clinically meaningful relation types. Each relation definition includes an English description and explicit constraints over allowable source and target node classes. These constraints are treated as hard requirements during inference and post-processing.
For each batch of candidate pairs, we construct a structured payload that includes the relation options, a set of relevant transcript turns, and per-pair structured node summaries. The turn context includes all turns in which either node is mentioned, along with immediate neighboring turns to preserve local discourse cues. Each node summary includes its canonical name, node class, constituent mentions, and extracted attributes. The LLM is instructed to assign at most one relation per candidate pair, selecting only from the provided relation list and otherwise returning no relation. The system prompt enforces an evidence-first standard and requires directionality to be explicit. Each output includes the chosen relation, direction, a brief explanation, and the set of supporting turn identifiers.
After labeling, each proposed relation is validated against the hard source and target constraints associated with that relation type. If the relation is incompatible with the node classes or entity types under the selected direction, it is replaced with the null relation to preserve schema validity. The final output is a list of relationship candidates augmented with the model-selected relation, direction, and evidence turn identifiers, which are then consumed by downstream SOAP section generation. The scale of the resulting encounter graphs across the evaluation sets is summarized in Table 1.

2.8. SOAP Document Generation

After constructing the ontology-grounded KG for an encounter, we generate the clinical note in SOAP format by prompting an LLM to produce a single integrated note containing the standard SOAP sections. The model is instructed to follow the SOAP structure and to ground content in the provided inputs using explicit citations. Our prompting template is based on the SOAP-note prompt and two-shot prompting strategy published by DocLens [3]. We adapt the prompts to specify how the KG should be weighted relative to the transcript when both are provided. When the transcript is not included, the two-shot examples are modified to show only the example outputs without the transcript inputs. The overall instruction framing remains consistent across conditions.
To study how structured graph context interacts with the raw transcript and to quantify the utility of different KG components, we evaluate three prompting variants that differ in which sources are supplied to the LLM.

2.8.1. Variant 1: KG-Only

In the KG-only setting, the KG serves as the sole source of truth. The LLM receives the node list and relationship list and is instructed to produce SOAP content using only information supported by the graph. Each sentence includes citations referencing graph provenance through node and relationship identifiers, enabling evaluation of whether statements are grounded in the structured representation. This variant measures how effectively a structured representation supports note generation without direct access to the underlying dialogue and isolates the impact of graph completeness and correctness on documentation quality.

2.8.2. Variant 2: Transcript + KG

In the transcript + KG setting, the raw transcript is provided alongside the full node and relationship lists, with the transcript explicitly designated as the authoritative source. The graph serves as secondary context to support entity tracking, consolidation of repeated mentions, and organization, but any conflicts are instructed to be resolved in favor of the transcript. The LLM is instructed to ground each sentence in specific transcript turns by attaching turn-index citations rather than citing graph identifiers. This variant tests whether structured context improves organization and coverage when the model can still verify and anchor facts directly to the encounter dialogue.

2.8.3. Variant 3: Transcript + Nodes Only

In the transcript + nodes only setting, the LLM receives the transcript together with the node list, but the relationship list is omitted. This variant addresses two practical considerations. First, relationship lists can be lengthy and may cause the model to focus disproportionately on the structured relationships rather than attending adequately to the transcript itself. Second, relationships are a higher-variance signal than entity inventories because they depend on correct directionality and relation-label selection, which can introduce noise even when the entity set is accurate.
By supplying only the node list, this variant preserves an explicit inventory of encounter concepts while keeping the transcript available for factual grounding and narrative phrasing. It acts as an intermediate condition between transcript-only generation and full transcript + KG generation, helping isolate whether the primary benefit of the KG comes from entity consolidation or explicit relational structure.

2.9. Evaluation Framework

We evaluate note generation using two complementary frameworks. ACI-BENCH provides a widely used shared benchmark for generating SOAP-style notes from clinician–patient dialogue and reports multiple automatic metrics for full notes and section-level outputs [20]. However, similarity-based scores can fail to reflect clinically salient omissions and unsupported detail. We therefore complement ACI-BENCH with DocLens-style claim and citation evaluation, which decomposes notes into claims, checks whether each claim is supported by the dialogue, quantifies omissions and unsupported additions relative to a reference note, and evaluates citation support when evidence links are provided [3]. This combination follows reliability-oriented evaluation work that treats unsupported additions, omissions, and misrepresentations as distinct error types with different safety implications [17,18]. Together, these frameworks allow us to assess benchmark-level note quality, claim-level factual coverage, and provenance quality under a shared experimental setup.

2.9.1. ACI-BENCH Benchmark and Automatic Metrics

The ACI-BENCH benchmark is a shared setting for evaluating SOAP generation from clinician–patient conversations [20]. It provides three official test sets that reflect different shared task sources and task definitions. Each test set contains 40 transcript–note pairs that we use for evaluation. Test set 1 is drawn from ACL ClinicalNLP MEDIQA-Chat 2023 Task B, test set 2 is drawn from ACL ClinicalNLP MEDIQA-Chat 2023 Task C, and test set 3 is drawn from CLEF MEDIQA-SUM 2023 Task C. We evaluate each reported DocLens and KOSMOS configuration on all three test sets. Results for ACI-BENCH’s original models are taken from the benchmark publication and GitHub repository rather than regenerated in our experiments. Missing values in our tables therefore reflect omissions in those published artifacts. In contrast, all ACI-BENCH metrics reported for the DocLens and KOSMOS models are computed from notes generated in our own experiments using the DocLens and KOSMOS codebases.
ACI-BENCH selected complementary automatic metrics because clinical note quality is not well captured by any single score [20]. ROUGE-1 measures unigram overlap, rewarding reuse of salient clinical terms while remaining relatively tolerant to local reordering and paraphrase. ROUGE-2 measures bigram overlap, making it stricter about local phrasing and more likely to penalize semantically similar rewrites that change word pairs. ROUGE-L and ROUGE-Lsum are both based on the longest common subsequence (LCS) but differ in how matches are aggregated across sentences. ROUGE-L averages sentence-level LCS matches, whereas ROUGE-Lsum computes an LCS-based score over the full multi-sentence output, typically treating sentence boundaries as line breaks. This makes ROUGE-Lsum more sensitive to whole-note structure and sentence segmentation. BERTScore takes a different approach by comparing generated and reference notes using contextual token embeddings from a pretrained model. It aligns tokens by computing maximum cosine-similarity matches across the two texts. Precision averages the best matches from generated to reference tokens, recall averages the best matches from reference to generated tokens, and F1 combines both as their harmonic mean. BLEURT goes further as a learned similarity metric trained to predict candidate–reference quality beyond n-gram overlap. However, it still reflects general-purpose matching rather than clinical correctness. Finally, MEDCON targets clinical content agreement by extracting UMLS concepts from both the generated and reference notes using QuickUMLS string matching, restricting to clinically relevant UMLS semantic groups, and then computing precision, recall, and F1 over the resulting concept sets [20].
Following ACI-BENCH, we report the ROUGE and equal-weight Average score used in that benchmark to support direct comparison with prior work [20]. Although the ACI-BENCH Average is useful for benchmark comparison, its equal weighting is a convenience for standardized reporting rather than a clinically calibrated measure of note quality. The ACI-BENCH aggregate is computed by first averaging the lexical ROUGE metrics as
ROUGE = ROUGE-1 + ROUGE-2 + ROUGE-Lsum 3 ,
then averaging that value with the semantic and concept metrics as
Average = ROUGE + BERTScore-F1 + BLEURT + MEDCON 4 .

2.9.2. DocLens Claim and Citation Evaluation

DocLens addresses a limitation of coarse similarity scores for medical text, where fluent outputs can score well even when they contain unsupported statements or omit clinically important details [3]. Rather than treating the note as a single block of text, DocLens decomposes the reference and generated notes into individual claims. An LLM is then prompted to evaluate each claim for support against the source dialogue and for agreement with the reference note. This yields separate measurements for factual coverage and unsupported additions, distinguishing recall failures from precision failures. DocLens also evaluates provenance by requiring each generated sentence to include citations to specific transcript turns. For each sentence, the framework gathers the cited turns as supporting context and uses the same LLM prompting approach to judge whether the cited evidence fully accounts for the sentence’s content. Sentences whose information is not completely supported by the referenced transcript turns are flagged as unsupported.
DocLens proposes a two-shot LLM prompting strategy for SOAP generation that uses the raw transcript for context and was originally evaluated with GPT-4 on ACI-BENCH test set 1 [3]. To enable comparison across a broader range of encounters, we reimplemented this baseline and evaluated it on all three ACI-BENCH test sets. Because our experiments use newer long-context models, we ran the DocLens prompt with both GPT-5.2 and GPT-4-turbo at temperature 0.0. The latter provides a closer proxy to the original GPT-4 setting while retaining sufficient context capacity for longer encounters. The released DocLens claim extraction stage iteratively generates 1 to 30 claims per note to cover salient content [3]. In our replication, we observed that many notes reached this maximum claim count. We therefore increased the extraction cap to 40 claims to reduce potential truncation of salient content and stabilize claim-level recall estimates.

3. Results

This section reports performance on ACI-BENCH and complements those benchmark scores with DocLens-style claim and citation evaluation focused on factual support and provenance.

3.1. ACI-BENCH Evaluation Metrics

Table 2, Table 3 and Table 4 report ACI-BENCH scores for the three official test sets, each containing 40 transcript note pairs, and include ROUGE, BERTScore, BLEURT, and MEDCON, along with an aggregate Average score when the benchmark artifacts provide all component metrics required to compute it. Figure 5 summarizes per model ROUGE-Lsum, and Figure 6 summarizes the Average score aggregated across the three test sets, where missing points occur because several baseline rows in the published ACI-BENCH artifacts do not report the full set of evaluation metrics. The ACI-BENCH baselines include BART-Large sequence-to-sequence summarizers, including variants adapted with PubMed pretraining or SAMSum fine-tuning, and Longformer–Encoder–Decoder (LED) models. Rows marked “(Division)” indicate section-wise generation where each SOAP segment is generated separately and then concatenated for evaluation. In addition to these baselines, we report DocLens results for two OpenAI chat models under the DocLens prompting setup and we report our KOSMOS results under three context conditions that vary the inputs provided to the generator, using KG only, Transcript plus Nodes, and Transcript plus KG.
Table 2 reports Test Set 1 results across all reported metrics. The strongest ROUGE scores come from the section-wise BART Large SAMSum (Division) baseline, which leads ROUGE-1, ROUGE-2, and ROUGE-Lsum by a clear margin over the other ROUGE baselines. In contrast, the best ROUGE-L is achieved by KOSMOS GPT-5.2 (Transcript + KG), edging out the next closest models, KOSMOS GPT-5.2 (Transcript + Nodes) and BART Large SAMSum (Division). For semantic similarity, BERTScore shows a split pattern. BART Large SAMSum (Division) leads precision and F1 while narrowly losing to KOSMOS GPT-5.2 (Transcript + KG) on recall. For BLEURT and MEDCON, KOSMOS GPT-5.2 (Transcript + KG) is highest, and the GPT-based models substantially outperform the non-GPT baselines on MEDCON. KOSMOS GPT-5.2 (Transcript + KG) attains the highest Average score, followed by KOSMOS GPT-5.2 (Transcript + Nodes), indicating a tight race between the two best-performing KOSMOS configurations.
Table 3 reports Test Set 2 results across all all reported ACI Benchmark metrics. The KOSMOS GPT-5.2 configurations lead the ROUGE metrics overall, with Transcript + KG achieving the highest ROUGE-1, ROUGE-L, and ROUGE-Lsum, while the section-wise BART Large SAMSum (Division) baseline attains the best ROUGE-2. Several division-based baselines omit ROUGE-L and the embedding-based metrics, so comparisons on BERTScore, BLEURT, and the aggregate are limited to the fully reported rows. Within the semantic similarity metrics, DocLens GPT-4-turbo leads BERTScore precision and F1 and ties for the top BLEURT score, whereas KOSMOS GPT-5.2 (Transcript + KG) narrowly achieves the best BERTScore recall. On clinical content agreement, KOSMOS GPT-5.2 (Transcript + KG) yields the highest MEDCON, with DocLens GPT-5.2 improving over GPT-4-turbo on MEDCON. The overall pattern is reflected in the aggregate, where KOSMOS GPT-5.2 (Transcript + KG) attains the highest Average score, followed by KOSMOS GPT-5.2 (Transcript + Nodes), indicating consistent gains from adding transcript context and incorporating the full KG.
Table 4 reports Test Set 3 results across all reported ACI Benchmark metrics. The baseline rows representing the data from ACI Benchmark report only ROUGE-1, ROUGE-2, Rouge-Lsum, and MEDCON scores, with Rouge-L and all BERTScore, BLEURT, and Average values missing, so comparisons on embedding-based metrics and the aggregate are only meaningful for DocLens and the KOSMOS variants. For ROUGE, the section-wise BART Large SAMSum (Division) baseline attains the best ROUGE-1 and ROUGE-2, continuing its advantage on lexical overlap metrics. In contrast, within the fully reported rows, KOSMOS GPT-5.2 (Transcript + KG) is strongest overall, achieving the best Rouge-L and Rouge-Lsum and also leading BERTScore recall, BLEURT, MEDCON, and the Average score. DocLens GPT-4-turbo again leads BERTScore precision and F1, while DocLens GPT-5.2 improves MEDCON and Average relative to GPT-4-turbo. Across the KOSMOS variants, adding transcript context continues to improves over KG-only, and incorporating the full KG yields moderate consistent gains over Transcript + Nodes across most metrics.
Figure 5 and Figure 6 visualize cross-test-set performance trends for all baselines on two summary metrics. Figure 5 plots the average ROUGE score for each system across Test Sets 1 through 3, enabling direct comparison of lexical overlap trends across the evaluation splits. Figure 6 provides the analogous view for the reported Average score, summarizing overall ACI-BENCH metric performance. Figure 6 contains fewer plotted points than Figure 5 because the Average value is only defined for rows where all component metrics are present. In Table 3 and Table 4, several baseline rows include missing entries for BERTScore, BLEURT, and MEDCON, which are part of the reported aggregate, preventing computation of the Average for those rows.
Across test sets, Figure 5 shows three fairly stable tiers. The division-based BART baseline is consistently highest, the KOSMOS Transcript variants (especially Transcript + KG) form the next tier and stay close to each other, and the remaining non-GPT baselines sit noticeably lower, with LED (Division) lowest. The DocLens models are consistently competitive but sit below the top KOSMOS and division-based BART lines on average ROUGE. DocLens GPT-5.2 is slightly above GPT-4-turbo in all three tests.
In Figure 6, the set of points is narrower because only fully reported rows appear, but the ordering is consistent across tests. KOSMOS Transcript + KG performs the best in every split, Transcript + Nodes is a close second, and distantly KG-only trails both. DocLens GPT-5.2 again edges out GPT-4-turbo, but both remain below the best KOSMOS configurations on the aggregate and above the KG-only KOSMOS and non-LLM results.

3.2. DocLens Evaluation Metrics

We next report DocLens-style, claim-level evaluation for coverage and faithfulness. All claim extraction, matching, and transcript grounding judgments were performed using GPT-4o (2024-08-06) with temperature = 0.0 to maximize scoring determinism.
In Table 5, DocLens GPT-4-turbo consistently exhibits the lowest recall across all test sets, while the KG-only KOSMOS model falls between it and the remaining transcript-based methods. Among the top performers, KOSMOS GPT-5.2 (Transcript + Nodes) achieves marginally higher recall than DocLens GPT-5.2 across all three test sets. KOSMOS GPT-5.2 (Transcript + KG) performs comparably to DocLens GPT-5.2, exceeding it slightly on two test sets and matching it on the third. The performance hierarchy remains consistent across test sets: Transcript + Nodes ranks highest, followed closely by Transcript + KG and DocLens GPT-5.2 in near parity, then KG-only, with GPT-4-turbo trailing behind.
In Table 6, precision scores are tightly clustered, with all methods falling between 69 and 73%. The KG-only KOSMOS variant achieves the lowest precision on Tests 2 and 3 and ranks second lowest on Test 1, resulting in the weakest overall average. The highest-performing model varies by test set: Transcript + KG leads on Test 1, Transcript + Nodes on Test 2, and DocLens GPT-4-turbo on Test 3. Averaged across all tests, Transcript + KG ranks first and Transcript + Nodes second, with both barely outperforming the two DocLens baselines.
In Table 7, hallucination rates more clearly distinguish the lower performing models than precision metrics do. DocLens GPT-4-turbo exhibits the highest hallucination rate across all test sets, averaging 4–5%. The KG-only KOSMOS variant consistently outperforms GPT-4-turbo but remains substantially worse than transcript-conditioned systems, with rates in the mid-2% range. By contrast, all transcript-conditioned GPT-5.2 methods maintain hallucination rates below 1% on every test set. Within this high-performing cluster, KOSMOS GPT-5.2 (Transcript + Nodes) achieves the lowest rate on each test set and the best overall average, demonstrating more consistent performance than KOSMOS GPT-5.2 (Transcript + KG) and DocLens GPT-5.2, both of which exhibit a spike on one test set.
Figure 7 summarizes Table 5, Table 6 and Table 7 by plotting each model’s average recall, precision, and grounded rate across the three test sets. Grounded rate, the complement of hallucination rate, was selected for visualization clarity. The two transcript-conditioned KOSMOS variants nearly overlap across all three metrics, indicating comparable performance regardless of whether structured context is provided as extracted nodes or as a full KG. Relative to DocLens GPT-5.2, the two KOSMOS transcript variants achieve similar precision and grounded rates while demonstrating a modest advantage in recall. The KG-only KOSMOS model performs lower across all three metrics, with DocLens GPT-4-turbo ranking lowest overall claim-level support.
Across the pooled evaluation of 120 transcript-note pairs, the paired Wilcoxon signed-rank results in Table 8, Table 9 and Table 10 show a clear separation between recall and hallucination behavior versus precision. For recall in Table 8, DocLens GPT-4-turbo and the KOSMOS GPT-5.2 KG-only condition are both significantly lower than DocLens GPT-5.2, with p-values under 5% and confidence intervals that are entirely negative, indicating consistent underperformance. The two transcript-conditioned KOSMOS variants show positive mean differences and confidence intervals that lean positive, though the differences do not reach statistical significance at the 5% threshold. Precision behaves differently in Table 9, where none of the methods show a statistically significant difference from DocLens GPT-5.2 at the 5% threshold and all confidence intervals include zero. Hallucination rate in Table 10 also exhibits strong effects. DocLens GPT-4-turbo and KOSMOS GPT-5.2 KG-only again show significantly higher hallucination rates than DocLens GPT-5.2, with p-values under 5% and fully positive intervals, while the transcript-conditioned KOSMOS variants show no significant differences and intervals tightly centered around zero. Overall, Table 8, Table 9 and Table 10 suggest that the largest statistically reliable differences relative to DocLens GPT-5.2 arise in recall and hallucination rate, while precision remains comparatively stable across methods.
Table 11 presents citation accuracy results across the three ACI-BENCH test sets, evaluated using the DocLens citation evaluation script. For each generated SOAP sentence, the evaluator determines whether the cited transcript turns provide sufficient evidence to fully support the clinical content. We report the percentage of adequately supported sentences for each test set, along with the average across all sets. GPT-4o served as the evaluation model with temperature set to 0.0. The results reveal two distinct performance clusters. DocLens GPT-5.2 and both transcript-conditioned KOSMOS variants, Transcript + Nodes and Transcript + KG, consistently achieve high citation accuracy across all three test sets, with the KOSMOS conditions slightly exceeding DocLens GPT-5.2 overall. In contrast, DocLens GPT-4 Turbo exhibits lower citation accuracy that declines progressively from Test 1 through Tests 2 and 3. The KG-only KOSMOS condition also underperforms on average, showing greater variability across test sets and reaching its lowest accuracy on Test 3.

4. Discussion

Across all three ACI-BENCH test sets, the strongest and most consistent performance comes from augmenting the ontology-typed knowledge graph with transcript context. Transcript plus KG ranks first on the Average score for all three test sets, with Transcript plus Nodes following closely behind. Both transcript-conditioned KG settings substantially outperform KG-only across every split. This pattern holds across heterogeneous test sets from different shared tasks, suggesting that structured, ontology-grounded context works best when it complements rather than replaces the original dialogue. The transcript appears to supply details that are easily lost during extraction and consolidation, including temporal qualifiers, attribution cues, and other encounter-specific nuances that help match expert notes. These benchmark-level results are consistent with H1b, H2a, and H3a. Transcript plus KG ranks first across test sets, Transcript plus Nodes remains close behind, and KG-only underperforms the transcript-conditioned settings.
Among the transcript-conditioned settings, the performance gap between Transcript plus Nodes and Transcript plus KG is consistently small compared to the large gap between KG-only and any transcript-based condition. This suggests that most of the benefit comes from providing a clean, ontology-typed entity inventory, while relationship structure offers smaller incremental gains. The ACI-BENCH metric suite is primarily sensitive to whether the right clinical concepts and phrases appear in the generated note, not whether the underlying representation captures correct inter-entity structure. If the node inventory already highlights the correct problems, medications, findings, and plans, then overlap-based and concept-based metrics can approach saturation without requiring explicit use of relationships. Additionally, relation prediction noise can introduce distractors or incorrect constraints that reduce the expected advantage of including edges.
A second pattern is that the division-based BART baseline often achieves strong ROUGE scores while performing less competitively on MEDCON. In contrast, the transcript-conditioned KOSMOS settings frequently lead on MEDCON and BLEURT, indicating stronger agreement on clinically salient concepts and overall semantic alignment even when exact n-gram overlap is slightly lower. This distinction matters because our ontology-guided structure is intended to improve content fidelity and clinical consistency, and concept-level agreement is closer to that objective than surface-form overlap alone. However, these automatic metrics still do not separate supported content from unsupported additions, so they cannot establish whether improved benchmark scores reflect stronger grounding.
The baseline families represented in ACI-BENCH each targeted a different limitation in clinical note generation. BART-Large provides a strong general-purpose sequence-to-sequence summarization backbone, LED extends that paradigm to much longer inputs, and BioBART adds biomedical domain pretraining to the same model family. When evaluated on ACI-BENCH, division-based generation generally outperformed full-note generation for the BART and LED families, suggesting that output-side decomposition helps these models manage SOAP structure and long conversational context [20]. DocLens takes a different approach, showing that strong transcript-conditioned prompting with in-context examples provides a competitive non-finetuned baseline [3]. KOSMOS addresses yet another bottleneck by restructuring encounter evidence before generation through an ontology-typed intermediate representation rather than changing the base sequence-to-sequence architecture, extending the context window, or relying solely on prompt design. Relative to these baselines, the main advantage of KOSMOS is not stronger lexical overlap, since division-based BART remains competitive or superior on ROUGE, but stronger concept fidelity and semantic alignment, as reflected more clearly in MEDCON and BLEURT.
Because ACI-BENCH captures reference alignment but not direct claim support, we complement it with claim-level and citation-based analyses against the transcript evidence. DocLens GPT-5.2 serves as our primary baseline for claim-level evaluation, representing a strong transcript-conditioned prompting approach without KG augmentation. A dominant trend in the claim-level results is that model capability strongly influences both coverage and faithfulness. DocLens GPT-4 Turbo performs substantially worse, achieving the lowest recall at 62.75% on average and the highest hallucination rate at 4.37% on average, including cases where the model fabricates patient last names that do not appear in the transcript. Upgrading to GPT-5.2 with the same prompting strategy yields large improvements, increasing average recall to 81.48% while reducing hallucination to 0.58%. This gap highlights a practical advantage of API-served foundation models, as meaningful reliability gains can often be obtained by adopting stronger models without rebuilding training data or retraining task-specific systems.
Against this baseline, KG-only context does not yet function as a reliable substitute for transcript access in our current pipeline. The KG-only setting achieves 76.65% average recall and a 2.56% average hallucination rate, which is worse than any transcript-conditioned GPT-5.2 condition, though it still outperforms the older DocLens GPT-4 Turbo baseline. This pattern suggests that extracting and consolidating an encounter into a graph can discard clinically important qualifiers and attribution cues that remain available when the generator can directly consult the dialogue. Precision results reinforce this limitation, since KG-only achieves the lowest average claim precision (69.59%) and does not improve over transcript-only prompting, indicating that restricting context to the distilled graph does not reduce unnecessary or unsupported additions in practice.
Within the transcript-conditioned settings, Transcript plus Nodes and Transcript plus KG perform similarly across all three test sets and are not statistically distinguishable from the DocLens GPT-5.2 baseline under paired Wilcoxon analysis. At the level of pooled recall, Transcript plus Nodes shows the strongest trend toward improvement with a 95% confidence interval of [ 0.16 ,   2.75 ] % and p = 13.16 % , while Transcript plus KG shows a weaker trend with a 95% confidence interval of [ 0.51 ,   2.33 ] % and p = 28.17 % . The raw averages match this pattern. Transcript plus Nodes improves recall by roughly 1 to 2 percentage points and achieves the lowest hallucination rate at 0.41% on average. Claim precision is also highest for Transcript plus KG (71.90%) and remains comparable for Transcript plus Nodes (71.44%) and DocLens GPT-5.2 (71.10%), suggesting a possible coverage benefit without an obvious precision penalty. Taken together, these results provide suggestive evidence that supplying a clean ontology-typed node inventory alongside the transcript can improve coverage without increasing unsupported content, while relationship information offers smaller incremental gains that may be attenuated or undermined by relation prediction noise. Because the pooled evaluation includes 120 transcript-note pairs and the confidence intervals include zero, additional samples are required to determine whether these differences reflect a statistically reliable effect rather than sampling variability. These claim-level findings provide only partial support for H2b and H3b. For H2b, Transcript plus KG improves citation accuracy over transcript-only prompting and maintains comparable precision and low hallucination, but it does not provide the clearest recall advantage, since Transcript plus Nodes shows the strongest recall trend. For H3b, the node-only condition does not behave as a strictly intermediate setting on every claim-level metric. Instead, it often matches or slightly exceeds Transcript plus KG on recall and hallucination while remaining close on precision and provenance. This pattern suggests that a clean ontology-typed node inventory captures most of the claim-level benefit of structured context, whereas relation information provides smaller and less consistent gains under the current extraction quality.
Provenance results further clarify when structured context improves reliability. Citation accuracy is consistently high when the transcript remains available, with Transcript plus Nodes and Transcript plus KG achieving approximately 91 to 92% accuracy on average compared to 88.59% for the DocLens GPT-5.2 baseline. This roughly three percentage point improvement suggests that adding an ontology-typed entity inventory can improve evidence selection and reduce citation drift, strengthening provenance in addition to content quality. This provides the clearest support for H2b, while the near parity between Transcript plus Nodes and Transcript plus KG indicates that H3b is only partially supported for provenance. In contrast, KG-only performs substantially worse and introduces an evaluation mismatch because its provenance is expressed as links to nodes and relations rather than transcript turns. To score KG-only under the DocLens evaluator, these links must be mapped back to mention-level turn indices, which often yields overly broad evidence sets when nodes aggregate multiple mentions or when relations are supported across dispersed dialogue segments. This loss of citation specificity can penalize otherwise correct outputs and contributes to lower and more variable citation accuracy for KG-only in this evaluation setting.
Taken together, the ACI-BENCH and DocLens evaluations support different parts of the revised hypothesis set with varying degrees of strength. At the benchmark level, the results are consistent with H1b, H2a, and H3a, in that KG-only reduces completeness and alignment, Transcript plus KG yields the strongest overall benchmark performance, and Transcript plus Nodes remains close to the full KG condition while clearly outperforming KG-only. These benchmark-level conclusions are driven by consistent ordering across test sets rather than by formal significance testing. At the claim level, the results do not support H1a, since KG-only increases unsupported content relative to transcript-only prompting. H2b receives partial support, with the clearest evidence coming from the statistically significant citation gains. H3b also receives only partial support, since Transcript plus Nodes often behaves as a near-match to Transcript plus KG rather than as a strictly intermediate condition, and the remaining differences are small, not statistically distinguishable, and associated with confidence intervals that include zero. Overall, the current evidence supports transcript-augmented ontology-guided context most clearly at the benchmark level and for provenance, while the finer-grained claim-level advantages remain suggestive rather than definitive.
At the same time, the marginal differences between Transcript plus KG and Transcript plus Nodes remain consistently small. This suggests that much of the measurable benefit comes from exposing a clean ontology-grounded inventory of clinically meaningful entities, while explicit relationship structure adds only modest gains in the current pipeline. If this pattern holds in larger studies, the node-only setting may be the more attractive near-term configuration when efficiency and engineering complexity matter. That practical asymmetry matters because relationship construction is also the most time-consuming stage of the KOSMOS pipeline. In deployment settings, this makes ontology-grounded nodes with provenance the highest-yield structured context to pair with the transcript, while relationship extraction is better treated as an optional refinement until higher-precision methods can provide more consistent gains.
From a workflow perspective, the current results suggest a modest and supportive role for KOSMOS rather than a transformative change in documentation practice. The improvements observed here are generally small, so the system should be understood as a potential aid to transcript-conditioned note generation and review, not as a replacement for clinician judgment or a guarantee of substantially lower documentation burden. In particular, any workflow that required clinicians to directly inspect or edit the knowledge graph itself could add complexity and reduce usability. The more plausible near-term value is that the ontology-grounded structure operates in the background to organize encounter evidence for generation, while provenance is still presented back to the clinician through links to the source transcript rather than through the graph representation itself. This design minimizes the need for direct interaction with the KG and may make verification somewhat easier when reviewing generated notes, but the current evidence supports only modest practical gains and continued clinician oversight.

Limitations

These findings should be interpreted in light of four limitations. The KG construction pipeline is not independently evaluated, all conditions rely on a single strong model, the observed improvements are generally modest and inconsistent across comparisons, and the practical workflow value of KOSMOS remains untested. Together, these constraints mean that while the results are encouraging, they do not yet establish that the full pipeline is robust across model settings, evaluation conditions, or real clinical workflows.
The KG construction pipeline, the central methodological contribution of KOSMOS, is not independently evaluated. No separate precision or recall measures are reported for entity extraction, relation extraction, attribute assignment, or UMLS grounding, so the quality of the intermediate representation is only observed indirectly through downstream note generation results. This makes it difficult to determine how much of the observed performance reflects the value of the ontology-guided architecture itself versus errors introduced during extraction, consolidation, or grounding. The question is sharpened by the downstream results, which suggest that ontology-grounded nodes provide more reliable gains than explicit relationships, potentially indicating that relation quality remains a limiting factor in the current pipeline.
All KOSMOS conditions are evaluated with GPT-5.2, making it difficult to separate the contribution of the proposed intermediate representation from the capabilities of the underlying model. Strong foundation models may already recover or infer some of the structure that KOSMOS makes explicit, while weaker models may depend more heavily on organized intermediate context to maintain coverage and consistency. Without evaluating the same KOSMOS conditions across a broader range of model families or capability levels, the study cannot determine how robust these effects are beyond this particular generation setting. The results therefore support the usefulness of ontology-guided context in combination with a strong transcript-conditioned model, but stop short of establishing how well the approach generalizes to less capable or differently trained generators.
The observed improvements over the strongest baseline are generally modest and do not consistently reach statistical significance across all comparisons. Several trends favor the transcript-conditioned KOSMOS settings, but the current sample size limits statistical power and makes it difficult to determine whether the smaller differences reflect stable advantages or sampling variability. This uncertainty is sharpest for the comparison between Transcript plus Nodes and Transcript plus KG, where the observed gaps are small and the confidence intervals include zero. Some findings are more consistent, however, including the benchmark-level improvements under ACI-BENCH and the gains in citation accuracy. The present results therefore support the value of ontology-guided intermediate context while leaving the size and consistency of its benefits not fully resolved across every evaluation setting.
The practical workflow value of KOSMOS is also inferred rather than directly tested. Although the transcript-conditioned settings show modest improvements in benchmark quality and citation accuracy, the study does not measure whether clinicians would review notes faster, trust them more, or make fewer corrections in practice. This gap matters most because any workflow requiring clinicians to directly inspect or edit the knowledge graph could increase complexity rather than reduce burden. The current design aims to minimize this risk by surfacing provenance through links back to the source transcript rather than through the graph itself, but the usability of that interaction has not been evaluated. The present results therefore support KOSMOS as a plausible assistive layer for documentation, while leaving its value as a demonstrated improvement to clinical workflow an open question.

5. Conclusions

This paper examined whether an encounter-specific, ontology-grounded knowledge graph can improve the reliability of SOAP note generation from clinical dialogue. We introduced KOSMOS, a transcript-to-SOAP framework that converts long, heterogeneous conversations into a schema-constrained, ontology-typed representation of clinically meaningful entities, attributes, and relations, grounded to UMLS concepts and tied to supporting evidence. As intermediate context between the dialogue and the final note, this representation is designed to make salient clinical content easier for an LLM to access and organize while preserving the traceability needed for verification in high-stakes settings.
Across ACI-BENCH and DocLens-style evaluation, the strongest gains come from using the KG to supplement the transcript rather than replace it. The transcript-conditioned KOSMOS variants consistently outperform KG-only and achieve the highest raw benchmark averages in our comparisons. Relative to the strongest transcript-only baseline, they maintain very low hallucination rates, show an approximately three percentage point citation accuracy gain over DocLens GPT-5.2, and show only suggestive, non-significant advantages on some claim-level measures, including recall. Overall, these findings suggest that ontology-guided structure is most effective as support for transcript-conditioned generation rather than as a substitute for dialogue access.
A key direction for future work is improving KG construction quality, especially for relationships, through automated error detection and correction. We plan to develop a graph error detector that identifies likely inconsistencies, omissions, and unsupported extractions, then triggers targeted analysis and correction prompts. Improving KG fidelity should strengthen both node and relation quality so that structured context can contribute more decisively to downstream generation and more reliably translate into improved SOAP note quality.

Author Contributions

Conceptualization, R.H. and J.G.; methodology, R.H.; software, R.H.; validation, R.H. and J.G.; formal analysis, R.H.; investigation, R.H.; resources, J.G.; data curation, R.H.; writing—original draft preparation, R.H.; writing—review and editing, J.G.; visualization, R.H.; supervision, J.G.; project administration, J.G.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Science Foundation (NSF) under Award No. 2333836. The views expressed in this article are those of the authors and do not necessarily reflect those of the NSF.

Data Availability Statement

The full implementation of KOSMOS, including the source code, executable, and evaluation scripts, is available at https://github.com/ryanwaynehenry/KOSMOS (accessed on 15 January 2026). This study did not generate new datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACI-BENCHAmbient Clinical Intelligence Benchmark
ATCAgent Telemetry and Control
BARTBidirectional and Auto-Regressive Transformers
BioBARTBiomedical BART
BLEURTBilingual Evaluation Understudy with Representations from Transformers
CLEFConference and Labs of the Evaluation Forum
EHRElectronic Health Record
GPTGenerative Pre-trained Transformer
IV&VIndependent Verification and Validation
KGKnowledge Graph
KG-RAGKnowledge Graph-based Retrieval-Augmented Generation
KOSMOSKnowledge Graph Ontology Supported Medical Output System
LCSLongest Common Subsequence
LEDLongformer–Encoder–Decoder
LLMLarge Language Model
LOINCLogical Observation Identifiers Names and Codes
LATLanguage (UMLS MRCONSO field)
MDDTMedical Device Development Tool
MEDCONMedical Concept Overlap metric (UMLS concept agreement metric in ACI-BENCH)
MEDIQAMedical Information Query Answering
MRCONSOMetathesaurus Concept Names and Sources (UMLS table/source file)
NSFNational Science Foundation
OpenIEOpen Information Extraction
ROUGERecall-Oriented Understudy for Gisting Evaluation
RxNormNormalized Names for Clinical Drugs
SABSource Abbreviation (UMLS MRCONSO field)
SaMDSoftware as a Medical Device
SAMSumSummarization of Dialogues dataset
SNOMED CTSystematized Nomenclature of Medicine—Clinical Terms
SOAPSubjective, Objective, Assessment, and Plan
TA1/TA2/TA3Technical Area 1/2/3
TTYTerm Type (UMLS MRCONSO field)
UCUMUnified Code for Units of Measure
UMLSUnified Medical Language System

References

  1. Noy, S.; Zhang, W. Experimental evidence on the productivity effects of generative artificial intelligence. Science 2023, 381, 187–192. [Google Scholar] [CrossRef] [PubMed]
  2. Magesh, V.; Surani, F.; Ryan, G.; Ritter, A.; Bansal, M. Hallucinations, Omissions, and Reality: Evaluating the Reliability of Large Language Models for Legal Summarization. J. Empir. Leg. Stud. 2025, 22, 216–242. [Google Scholar] [CrossRef]
  3. Xie, Y.; Zhang, S.; Cheng, H.; Liu, P.; Gero, Z.; Wong, C.; Naumann, T.; Poon, H.; Rose, C. DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 649–679. [Google Scholar] [CrossRef]
  4. Neupane, S.; Tripathi, H.; Mitra, S.; Bozorgzad, S.; Mittal, S.; Rahimi, S.; Amirlatifi, A. CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 5050–5059. [Google Scholar] [CrossRef]
  5. Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv 2023, arXiv:2309.11495v2. [Google Scholar] [CrossRef]
  6. Shanafelt, T.D.; West, C.P.; Sinsky, C.; Trockel, M.; Tutty, M.; Wang, H.; Carlasare, L.; Dyrbye, L.N. Changes in Burnout and Satisfaction with Work-Life Integration in Physicians and the General US Working Population Between 2011 and 2023. Mayo Clin. Proc. 2025, 100, 1142–1158. [Google Scholar] [CrossRef]
  7. Centers for Medicare & Medicaid Services. Fact Sheet: Electronic Health Records Provider. 2015. Documentation Matters Toolkit. Available online: https://www.cms.gov/medicare-medicaid-coordination/fraud-prevention/medicaid-integrity-education/downloads/docmatters-ehr-providerfactsheet.pdf (accessed on 9 March 2026).
  8. American Medical Association. What Is Physician Burnout? 2025. Available online: https://www.ama-assn.org/practice-management/physician-health/what-physician-burnout (accessed on 9 January 2026).
  9. Arndt, B.G.; Beasley, J.W.; Watkinson, M.D.; Temte, J.L.; Tuan, W.J.; Sinsky, C.A.; Gilchrist, V.J. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Ann. Fam. Med. 2017, 15, 419–426. [Google Scholar] [CrossRef]
  10. Young, R.A.; Burge, S.K.; Kumar, K.A.; Wilson, J.M.; Ortiz, D.F. A Time-Motion Study of Primary Care Physicians’ Work in the Electronic Health Record Era. Fam. Med. 2018, 50, 91–99. [Google Scholar] [CrossRef]
  11. Rizvi, R.F.; Harder, K.A.; Hultman, G.M.; Adam, T.J.; Kim, M.; Pakhomov, S.V.S.; Melton, G.B. A comparative observational study of inpatient clinical note-entry and reading/retrieval styles adopted by physicians. Int. J. Med. Inform. 2016, 90, 1–11. [Google Scholar] [CrossRef]
  12. Mess, S.A.; Mackey, A.J.; Yarowsky, D.E. Artificial Intelligence Scribe and Large Language Model Technology in Healthcare Documentation: Advantages, Limitations, and Recommendations. Plast. Reconstr. Surg. Glob. Open 2025, 13, e6450. [Google Scholar] [CrossRef]
  13. Quiroz, J.C.; Laranjo, L.; Kocaballi, A.B.; Berkovsky, S.; Rezazadegan, D.; Coiera, E. Challenges of developing a digital scribe to reduce clinical documentation burden. npj Digit. Med. 2019, 2, 114. [Google Scholar] [CrossRef]
  14. Krishna, K.; Khosla, S.; Bigham, J.; Lipton, Z.C. Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4958–4972. [Google Scholar] [CrossRef]
  15. Ben Abacha, A.; Yim, W.w.; Fan, Y.; Lin, T. An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics; Vlachos, A., Augenstein, I., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2291–2302. [Google Scholar] [CrossRef]
  16. Michalopoulos, G.; Williams, K.; Singh, G.; Lin, T. MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4741–4749. [Google Scholar] [CrossRef]
  17. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
  18. Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Au Yeung, J.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
  19. Ramprasad, S.; Ferracane, E.; Selvaraj, S.P. Generating more faithful and consistent SOAP notes using attribute-specific parameters. Proc. Mach. Learn. Res. 2023, 219, 631–649. [Google Scholar]
  20. Yim, W.w.; Fu, Y.; Ben Abacha, A.; Snider, N.; Lin, T.; Yetisgen, M. Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Sci. Data 2023, 10, 586. [Google Scholar] [CrossRef]
  21. Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; de Melo, G.; Gutiérrez, C.; Kirrane, S.; Labra Gayo, J.E.; Navigli, R.; Neumaier, S.; et al. Knowledge Graphs. ACM Comput. Surv. 2021, 54, 71. [Google Scholar] [CrossRef]
  22. Donnelly, K. SNOMED-CT: The advanced terminology and coding system for eHealth. Int. J. Med. Inform. 2006, 75, 279–290. [Google Scholar] [CrossRef]
  23. Nelson, S.J.; Zeng, K.; Kilbourne, J.; Powell, T.; Moore, R. Normalized names for clinical drugs: RxNorm at 6 years. J. Am. Med. Inform. Assoc. 2011, 18, 441–448. [Google Scholar] [CrossRef]
  24. McDonald, C.J.; Huff, S.M.; Suico, J.G.; Hill, G.; Leavelle, D.; Aller, R.; Forrey, A.; Mercer, K.; DeMoor, G.; Hook, J.; et al. LOINC, a universal standard for identifying laboratory observations: A 5-year update. Clin. Chem. 2003, 49, 624–633. [Google Scholar] [CrossRef] [PubMed]
  25. Bodenreider, O. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef]
  26. National Library of Medicine (US). UMLS Knowledge Sources [Dataset on the Internet]; National Library of Medicine (US): Bethesda, MD, USA, 2024. Available online: http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html (accessed on 15 July 2024).
  27. Cimino, J.J. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf. Med. 1998, 37, 394–403. [Google Scholar] [CrossRef]
  28. Davidson, R.; Hardman, W.; Amit, G.; Bilu, Y.; Della Mea, V.; Galaida, A.; Girshovitz, I.; Kulyabin, M.; Popescu, M.H.; Roitero, K.; et al. SNOMED CT entity linking challenge. J. Am. Med. Inform. Assoc. 2025, 32, 1397–1406. [Google Scholar] [CrossRef]
  29. Soman, K.; Rose, P.W.; Morris, J.H.; Akbas, R.E.; Smith, B.; Peetoom, B.; Villouta-Reyes, C.; Cerono, G.; Shi, Y.; Rizk-Jackson, A.; et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics 2024, 40, btae560. [Google Scholar] [CrossRef] [PubMed]
  30. Guo, R.; Devereux, B.; Farnan, G.; McLaughlin, N. LAB-KG: A retrieval-augmented generation method with knowledge graphs for medical lab test interpretation. In Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025; European Language Resources Association: Paris, France, 2025; pp. 40–50. [Google Scholar]
  31. Jain, S.; Agrawal, A.; Saporta, A.; Truong, S.Q.; Duong, D.N.; Bui, T.; Chambon, P.; Zhang, Y.; Lungren, M.P.; Ng, A.Y.; et al. RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Virtual, 7–10 December 2021. [Google Scholar] [CrossRef]
  32. Delbrouck, J.B.; Chambon, P.; Bluethgen, C.; Tsai, E.; Almusa, O.; Langlotz, C. Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4348–4360. [Google Scholar] [CrossRef]
  33. Wang, H.; Niu, J.; Liu, X.; Wang, Y. Embracing Uniqueness: Generating Radiology Reports via a Transformer with Graph-Based Distinctive Attention. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE: Piscataway, NJ, USA, 2022; pp. 581–588. [Google Scholar] [CrossRef]
  34. Delbrouck, J.B.; Chambon, P.; Chen, Z.; Varma, M.; Johnston, A.; Blankemeier, L.; Van Veen, D.; Bui, T.; Truong, S.; Langlotz, C. RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 12902–12915. [Google Scholar] [CrossRef]
  35. Mo, B.; Yu, K.; Kazdan, J.; Cabezas, J.; Mpala, P.; Yu, L.; Cundy, C.; Kanatsoulis, C.; Koyejo, S. KGGen: Extracting Knowledge Graphs from Plain Text with Language Models. arXiv 2025, arXiv:2502.09956v2. [Google Scholar] [CrossRef]
  36. Agrawal, G.; Kumarage, T.; Alghamdi, Z.; Liu, H. Can Knowledge Graphs Reduce Hallucinations in LLMs? A Survey. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3947–3960. [Google Scholar] [CrossRef]
  37. Choi, E.; Palomaki, J.; Lamm, M.; Kwiatkowski, T.; Das, D.; Collins, M. Decontextualization: Making Sentences Stand-Alone. Trans. Assoc. Comput. Linguist. 2021, 9, 447–461. [Google Scholar] [CrossRef]
  38. Angeli, G.; Johnson Premkumar, M.J.; Manning, C.D. Leveraging Linguistic Structure For Open Domain Information Extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Zong, C., Strube, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 344–354. [Google Scholar] [CrossRef]
  39. Chen, B.; Bertozzi, A.L. AutoKG: Efficient Automated Knowledge Graph Generation for Language Models. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 3117–3126. [Google Scholar] [CrossRef]
  40. Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
  41. Li, G.; Xu, Z.; Shang, Z.; Liu, J.; Ji, K.; Guo, Y. Empirical analysis of dialogue relation extraction with large language models. arXiv 2024, arXiv:2404.17802. [Google Scholar] [CrossRef]
  42. Cao, L.; Sun, J.; Cross, A. An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study. JMIR Med. Inform. 2024, 12, e60665. [Google Scholar] [CrossRef]
  43. Haberle, T.; Cleveland, C.; Snow, G.L.; Barber, C.; Stookey, N.; Thornock, C.; Younger, L.; Mullahkhel, B.; Ize-Ludlow, D. The impact of nuance DAX ambient listening AI documentation: A cohort study. J. Am. Med. Inform. Assoc. 2024, 31, 975–979. [Google Scholar] [CrossRef] [PubMed]
  44. Morey, J.; Jones, D.; Walker, L.; Lindor, R.; Schupbach, J.; Mullan, A.; Heaton, H. Ambient Artificial Intelligence Versus Human Scribes in the Emergency Department. Ann. Emerg. Med. 2025. [Google Scholar] [CrossRef]
  45. Anderson, T.N.; Mohan, V.; Dorr, D.A.; Ratwani, R.M.; Biro, J.M.; Gold, J.A. Evaluating the Quality and Safety of Ambient Digital Scribe Platforms Using Simulated Ambulatory Encounters. Mayo Clin. Proc. Digit. Health 2025, 3, 100292. [Google Scholar] [CrossRef]
Figure 1. Overview of the Knowledge graph Ontology Supported Medical Output System (KOSMOS) process. A raw transcript is first pre-processed to add turn labels and make the language more clear. Then it identifies clinically relevant mentions and assign each a type (e.g., problem, activity, medication, lab test, measurement). Mentions are then consolidated into canonical concepts that aggregate evidence across turns. Finally, the resulting concepts converted into nodes and are connected into a knowledge graph using typed, directed relationships to represent the encounter as an explicit set of entities and relations. The square brackets “[]” refer to what transcript turn the transcript or mention comes from. The curly braces “{}” refer to what set of transcript turns the concepts are derived from.
Figure 1. Overview of the Knowledge graph Ontology Supported Medical Output System (KOSMOS) process. A raw transcript is first pre-processed to add turn labels and make the language more clear. Then it identifies clinically relevant mentions and assign each a type (e.g., problem, activity, medication, lab test, measurement). Mentions are then consolidated into canonical concepts that aggregate evidence across turns. Finally, the resulting concepts converted into nodes and are connected into a knowledge graph using typed, directed relationships to represent the encounter as an explicit set of entities and relations. The square brackets “[]” refer to what transcript turn the transcript or mention comes from. The curly braces “{}” refer to what set of transcript turns the concepts are derived from.
Information 17 00355 g001
Figure 2. Example encounter knowledge graph representing a simple doctor–patient visit. Nodes depict typed entities (patient, clinician, condition, symptom, lab test, medication, procedure, and activity) and directed edges encode typed relationships (e.g., evaluated_by, diagnosed, ordered_test, has_medication) that connect the patient to clinically relevant facts.
Figure 2. Example encounter knowledge graph representing a simple doctor–patient visit. Nodes depict typed entities (patient, clinician, condition, symptom, lab test, medication, procedure, and activity) and directed edges encode typed relationships (e.g., evaluated_by, diagnosed, ordered_test, has_medication) that connect the patient to clinically relevant facts.
Information 17 00355 g002
Figure 3. KG construction pipeline used in KOSMOS. Each box shows a pipeline stage, with the stage name at the top and the artifact produced or updated at the bottom. The process transforms a raw transcript into a structured KG by segmenting and normalizing turns, rewriting pronouns for clarity, extracting and typing mentions, grouping mentions into candidate entities, grounding candidates to ontology concepts, constructing typed KG nodes with attributes, proposing relationship pair candidates, and selecting typed relationships to form the final graph.
Figure 3. KG construction pipeline used in KOSMOS. Each box shows a pipeline stage, with the stage name at the top and the artifact produced or updated at the bottom. The process transforms a raw transcript into a structured KG by segmenting and normalizing turns, rewriting pronouns for clarity, extracting and typing mentions, grouping mentions into candidate entities, grounding candidates to ontology concepts, constructing typed KG nodes with attributes, proposing relationship pair candidates, and selecting typed relationships to form the final graph.
Information 17 00355 g003
Figure 4. Sample knowledge graph constructed from an ACI-BENCH encounter transcript. Nodes represent concepts extracted from the dialogue, and directed edges denote schema-constrained clinical relations predicted over candidate node pairs.
Figure 4. Sample knowledge graph constructed from an ACI-BENCH encounter transcript. Nodes represent concepts extracted from the dialogue, and directed edges denote schema-constrained clinical relations predicted over candidate node pairs.
Information 17 00355 g004
Figure 5. Average ROUGE scores across three ACI-BENCH test sets. Each value represents the mean of ROUGE-1, ROUGE-2, and ROUGE-Lsum, which measure n-gram overlap and sequence-level similarity between generated and reference notes. BART Large SAMSum (Division) achieves the highest overall ROUGE scores across all test sets. Our structured-context variants—KOSMOS GPT-5.2 (Transcript + Nodes) and KOSMOS GPT-5.2 (Transcript + KG)—consistently outperform the transcript-only DocLens baselines, demonstrating that incorporating extracted nodes or KGs improves summary fidelity.
Figure 5. Average ROUGE scores across three ACI-BENCH test sets. Each value represents the mean of ROUGE-1, ROUGE-2, and ROUGE-Lsum, which measure n-gram overlap and sequence-level similarity between generated and reference notes. BART Large SAMSum (Division) achieves the highest overall ROUGE scores across all test sets. Our structured-context variants—KOSMOS GPT-5.2 (Transcript + Nodes) and KOSMOS GPT-5.2 (Transcript + KG)—consistently outperform the transcript-only DocLens baselines, demonstrating that incorporating extracted nodes or KGs improves summary fidelity.
Information 17 00355 g005
Figure 6. Average ACI-BENCH aggregate scores across test sets. Each point summarizes overall note quality by first averaging ROUGE-1, ROUGE-2, and ROUGE-Lsum, then averaging that ROUGE mean with BERTScore F1, BLEURT, and MEDCON. Not all model series contain values for every test set because ACI-BENCH did not report the full metric bundle for every baseline between their paper and the released repository, so missing points are left blank rather than imputed. Across the reported test sets, the strongest results are achieved by KOSMOS GPT-5.2 (Transcript + KG) and KOSMOS GPT-5.2 (Transcript + Nodes), which form the top tier and sit above the DocLens GPT-5.2 and DocLens GPT-4-turbo baselines. DocLens models are superior to the non-LLM baselines, while KOSMOS (KG only) improves over several baselines but trails the transcript-conditioned DocLens and KOSMOS variants.
Figure 6. Average ACI-BENCH aggregate scores across test sets. Each point summarizes overall note quality by first averaging ROUGE-1, ROUGE-2, and ROUGE-Lsum, then averaging that ROUGE mean with BERTScore F1, BLEURT, and MEDCON. Not all model series contain values for every test set because ACI-BENCH did not report the full metric bundle for every baseline between their paper and the released repository, so missing points are left blank rather than imputed. Across the reported test sets, the strongest results are achieved by KOSMOS GPT-5.2 (Transcript + KG) and KOSMOS GPT-5.2 (Transcript + Nodes), which form the top tier and sit above the DocLens GPT-5.2 and DocLens GPT-4-turbo baselines. DocLens models are superior to the non-LLM baselines, while KOSMOS (KG only) improves over several baselines but trails the transcript-conditioned DocLens and KOSMOS variants.
Information 17 00355 g006
Figure 7. DocLens-style claim evaluation averages across models. Recall is the percentage of claims in the gold (reference) SOAP note that also appear in the generated SOAP note. Precision is the percentage of claims in the generated SOAP note that also appear in the gold SOAP note. Grounded Rate* is the percentage of generated claims whose that can be fully justified by the transcript without unsupported additions. KOSMOS GPT-5.2 (Transcript + Nodes) and KOSMOS GPT-5.2 (Transcript + KG) are nearly identical across all three metrics. Both are very close to DocLens GPT-5.2 in precision and grounding rate, while showing a slight edge in recall, suggesting improved coverage of gold claims without sacrificing support.
Figure 7. DocLens-style claim evaluation averages across models. Recall is the percentage of claims in the gold (reference) SOAP note that also appear in the generated SOAP note. Precision is the percentage of claims in the generated SOAP note that also appear in the gold SOAP note. Grounded Rate* is the percentage of generated claims whose that can be fully justified by the transcript without unsupported additions. KOSMOS GPT-5.2 (Transcript + Nodes) and KOSMOS GPT-5.2 (Transcript + KG) are nearly identical across all three metrics. Both are very close to DocLens GPT-5.2 in precision and grounding rate, while showing a slight edge in recall, suggesting improved coverage of gold claims without sacrificing support.
Information 17 00355 g007
Table 1. Encounter KG size statistics across the three ACI-BENCH test sets.
Table 1. Encounter KG size statistics across the three ACI-BENCH test sets.
StatisticMinMeanMax
Number of Nodes2555.2198
Number of Relationships2970.33133
Table 2. Test Set 1 ACI Benchmark Results. The section-wise BART Large SAMSum (Division) baseline achieves the strongest ROUGE-1, ROUGE-2, and ROUGE-Lsum scores and also leads BERTScore precision and F1, reflecting the highest lexical overlap among baselines. The GPT-based systems are most competitive overall, particularly on MEDCON. Among the KOSMOS variants, adding transcript context improves over the KG-only setting, and the Transcript + KG configuration achieves the best ROUGE-L, BERTScore recall, BLEURT, MEDCON, and Average score. Gray-highlighted, bold values indicate the highest result in each metric column.
Table 2. Test Set 1 ACI Benchmark Results. The section-wise BART Large SAMSum (Division) baseline achieves the strongest ROUGE-1, ROUGE-2, and ROUGE-Lsum scores and also leads BERTScore precision and F1, reflecting the highest lexical overlap among baselines. The GPT-based systems are most competitive overall, particularly on MEDCON. Among the KOSMOS variants, adding transcript context improves over the KG-only setting, and the Transcript + KG configuration achieves the best ROUGE-L, BERTScore recall, BLEURT, MEDCON, and Average score. Gray-highlighted, bold values indicate the highest result in each metric column.
ModelRouge-1Rouge-2Rouge-LRouge-LsumBERT-PrecisionBERT-RecallBERT-F1BLEURTMEDCONAverage
BART Large0.41760.19200.23700.34700.63990.57070.60290.41050.43730.4255
BART LARGE SAMSum0.40870.18960.23020.34600.64320.56920.60340.41770.42070.4220
BART LARGE SAMSum (Division)0.53460.25080.29630.48620.66750.68280.67460.38520.48840.4754
BioBART0.39090.17240.21510.33190.64070.56940.60250.38440.42850.4089
BioBART (Division)0.49530.22470.27260.44920.65820.67040.66360.35730.43330.4411
LED (Division)0.30460.06930.11210.26660.52510.58470.55300.18590.32340.2773
DocLens GPT-4-turbo0.49150.18300.27310.44590.64530.67210.65790.41430.57660.4816
DocLens GPT-5.20.49700.18640.29080.45980.62770.67750.65130.41290.60030.4873
KOSMOS GPT-5.2 (KG only)0.45940.17900.25390.42350.53130.63130.57620.39980.59940.4608
KOSMOS GPT-5.2 (Transcript + Nodes)0.52170.20510.30360.48160.62800.68360.65420.42040.60550.4989
KOSMOS GPT-5.2 (Transcript + KG)0.52510.21170.30850.48520.62970.68470.65580.42430.61800.5049
Table 3. Test Set 2 ACI Benchmark Results. The KOSMOS GPT-5.2 variants outperform the baseline summarization models on most ROUGE metrics, with Transcript + KG achieving the best Rouge-1, Rouge-L, and Rouge-Lsum, while the division-based BART model attains the best Rouge-2. The OpenAI models remain strong, with DocLens GPT-4-turbo leading BERTScore precision and F1 and tying for the top BLEURT score, while DocLens GPT-5.2 improves MEDCON over GPT-4-turbo. Among the KOSMOS variants, adding the transcript context consistently improves over KG-only, and Transcript + KG yields the best BERTScore recall, MEDCON, and average score. Gray-highlighted, bold values indicate the highest result in each metric column.
Table 3. Test Set 2 ACI Benchmark Results. The KOSMOS GPT-5.2 variants outperform the baseline summarization models on most ROUGE metrics, with Transcript + KG achieving the best Rouge-1, Rouge-L, and Rouge-Lsum, while the division-based BART model attains the best Rouge-2. The OpenAI models remain strong, with DocLens GPT-4-turbo leading BERTScore precision and F1 and tying for the top BLEURT score, while DocLens GPT-5.2 improves MEDCON over GPT-4-turbo. Among the KOSMOS variants, adding the transcript context consistently improves over KG-only, and Transcript + KG yields the best BERTScore recall, MEDCON, and average score. Gray-highlighted, bold values indicate the highest result in each metric column.
ModelRouge-1Rouge-2Rouge-LRouge-LsumBERT-PrecisionBERT-RecallBERT-F1BLEURTMEDCONAverage
BART Large0.41900.19870.23030.34560.64080.56470.60000.40400.44540.4265
BART Large SAMSum0.40370.18860.22410.34260.64170.56070.59810.40990.44320.4237
BART LARGE SAMSum (Division)0.52080.2437-0.4716----0.4812-
BioBART0.39000.18440.22080.33400.64170.56310.59950.39830.43190.4153
BioBART (Division)0.50800.2270-0.4613----0.4476-
LED (Division)0.35140.08570.12650.30840.53010.58920.55800.25210.34240.3172
DocLens GPT-4-turbo0.49800.17960.26210.45100.65640.67390.66470.41580.56390.4808
DocLens GPT-5.20.49930.19140.28800.46110.63000.67350.65080.40830.58810.4847
KOSMOS GPT-5.2 (KG only)0.45840.17850.24450.42260.53440.62580.57570.39430.58980.4570
KOSMOS GPT-5.2 (Transcript + Nodes)0.51900.21090.29820.48030.63160.68060.65500.41480.60160.4974
KOSMOS GPT-5.2 (Transcript + KG)0.52370.21190.30380.48250.63550.68370.65860.41580.61260.5014
Table 4. Test Set 3 ACI Benchmark Results. Several baseline rows report only ROUGE-1, ROUGE-2, Rouge-Lsum, and MEDCON, with Rouge-L and all BERTScore, BLEURT, and average values missing, so comparisons on embedding-based and learned metrics are only meaningful for DocLens and the KOSMOS variants. Within the metrics that are fully reported, the KOSMOS GPT-5.2 variants are strongest overall, with Transcript + KG achieving the best Rouge-L and Rouge-Lsum and also leading BERTScore recall, BLEURT, MEDCON, and the overall average. The division-based BART model attains the best Rouge-1 and Rouge-2 among all models, continuing its advantage on n-gram overlap metrics. The OpenAI models remain competitive, with DocLens GPT-4-turbo once again leading BERTScore precision and F1. Adding transcript context improves over KG-only for KOSMOS, and incorporating the full KG yields consistent gains over Transcript + Nodes across most metrics. Gray-highlighted, bold values indicate the highest result in each metric column.
Table 4. Test Set 3 ACI Benchmark Results. Several baseline rows report only ROUGE-1, ROUGE-2, Rouge-Lsum, and MEDCON, with Rouge-L and all BERTScore, BLEURT, and average values missing, so comparisons on embedding-based and learned metrics are only meaningful for DocLens and the KOSMOS variants. Within the metrics that are fully reported, the KOSMOS GPT-5.2 variants are strongest overall, with Transcript + KG achieving the best Rouge-L and Rouge-Lsum and also leading BERTScore recall, BLEURT, MEDCON, and the overall average. The division-based BART model attains the best Rouge-1 and Rouge-2 among all models, continuing its advantage on n-gram overlap metrics. The OpenAI models remain competitive, with DocLens GPT-4-turbo once again leading BERTScore precision and F1. Adding transcript context improves over KG-only for KOSMOS, and incorporating the full KG yields consistent gains over Transcript + Nodes across most metrics. Gray-highlighted, bold values indicate the highest result in each metric column.
ModelRouge-1Rouge-2Rouge-LRouge-LsumBERT-PrecisionBERT-RecallBERT-F1BLEURTMEDCONAverage
BART Large0.40540.1852-0.3462----0.4492-
BART Large SAMSum0.39380.1838-0.3389----0.4601-
BART LARGE SAMSum (Division)0.52770.2438-0.4803----0.4756-
BioBART0.38320.1739-0.3339----0.4306-
BioBART (Division)0.50280.2295-0.4609----0.4321-
LED (Division)0.34710.0803-0.3077----0.3379-
DocLens GPT-4-turbo0.50200.18460.27610.46250.66640.67150.66850.41530.57580.4863
DocLens GPT-5.20.50180.19200.29060.46700.63230.67350.65210.41280.60580.4907
KOSMOS GPT-5.2 (KG only)0.46300.18330.24140.43080.54510.62910.58360.38750.61900.4643
KOSMOS GPT-5.2 (Transcript + Nodes)0.52160.20840.30210.48810.63470.67920.65610.41590.62280.5027
KOSMOS GPT-5.2 (Transcript + KG)0.52410.21460.30700.48960.63860.68250.65970.41920.62940.5073
Table 5. Claim recall percentage across three test sets. Recall is computed by extracting a set of reference claims from the gold SOAP notes and reporting the percentage of those reference claims that are also present in the SOAP notes generated by each method, penalizing omissions of important information.
Table 5. Claim recall percentage across three test sets. Recall is computed by extracting a set of reference claims from the gold SOAP notes and reporting the percentage of those reference claims that are also present in the SOAP notes generated by each method, penalizing omissions of important information.
ModelTest 1Test 2Test 3Avg
DocLens GPT-4-turbo60.42%63.18%64.73%62.75%
DocLens GPT-5.280.71%82.70%81.04%81.48%
KOSMOS GPT-5.2 (KG only)75.74%76.82%77.39%76.65%
KOSMOS GPT-5.2 (Transcript + Nodes)82.94%83.07%82.25%82.75%
KOSMOS GPT-5.2 (Transcript + KG)81.66%82.62%82.90%82.39%
Table 6. Claim precision percentage across three test sets. Precision is computed by extracting a set of claims from each generated SOAP note and reporting the percentage of those generated claims that are also present in the corresponding gold SOAP note, penalizing unsupported or unnecessary additions.
Table 6. Claim precision percentage across three test sets. Precision is computed by extracting a set of claims from each generated SOAP note and reporting the percentage of those generated claims that are also present in the corresponding gold SOAP note, penalizing unsupported or unnecessary additions.
ModelTest 1Test 2Test 3Avg
DocLens GPT-4-turbo69.07%72.25%72.27%71.18%
DocLens GPT-5.271.75%71.25%70.31%71.10%
KOSMOS GPT-5.2 (KG only)69.32%69.30%70.14%69.59%
KOSMOS GPT-5.2 (Transcript + Nodes)70.02%73.07%71.25%71.44%
KOSMOS GPT-5.2 (Transcript + KG)71.91%72.57%71.23%71.90%
Table 7. Hallucination rate percentage across three test sets. The hallucination rate is computed by extracting a set of claims from each generated SOAP note and reporting the percentage of those claims that lack supporting evidence in the original transcript.
Table 7. Hallucination rate percentage across three test sets. The hallucination rate is computed by extracting a set of claims from each generated SOAP note and reporting the percentage of those claims that lack supporting evidence in the original transcript.
ModelTest 1Test 2Test 3Avg
DocLens GPT-4-turbo5.11%4.09%3.90%4.37%
DocLens GPT-5.20.59%0.90%0.26%0.58%
KOSMOS GPT-5.2 (KG only)2.34%2.79%2.56%2.56%
KOSMOS GPT-5.2 (Transcript + Nodes)0.42%0.58%0.24%0.41%
KOSMOS GPT-5.2 (Transcript + KG)0.44%0.67%0.42%0.51%
Table 8. Recall significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. Δ represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
Table 8. Recall significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. Δ represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
Method Δ Mean 95% CIWp Value
DocLens GPT-4-turbo[−21.26, −16.18]%833.73 × 10−17%
KOSMOS GPT-5.2 (KG only)[−6.62, −3.10]%12551.04 × 10−4%
KOSMOS GPT-5.2 (Transcript + Nodes)[−0.16, 2.75]%179213.16%
KOSMOS GPT-5.2 (Transcript + KG)[−0.51, 2.33]%178028.17%
Table 9. Precision significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. Δ represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
Table 9. Precision significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. Δ represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
Method Δ Mean 95% CIWp Value
DocLens GPT-4-turbo[−2.28, 2.44]%3612.596.34%
KOSMOS GPT-5.2 (KG only)[−3.40, 0.37]%3016.510.81%
KOSMOS GPT-5.2 (Transcript + Nodes)[−1.30, 1.97]%3049.551.92%
KOSMOS GPT-5.2 (Transcript + KG)[−0.79, 2.31]%317144.55%
Table 10. Hallucination rate significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. Δ represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
Table 10. Hallucination rate significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. Δ represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
Method Δ Mean 95% CIWp Value
DocLens GPT-4-turbo[2.82, 4.81]%191.52.89 × 10−9%
KOSMOS GPT-5.2 (KG only)[1.32, 2.64]%218.52.38 × 10−6%
KOSMOS GPT-5.2 (Transcript + Nodes)[−0.48, 0.12]%153.539.36%
KOSMOS GPT-5.2 (Transcript + KG)[−0.43, 0.28]%197.590.03%
Table 11. Citation accuracy across the three ACI-BENCH test sets. Values report the percentage of generated SOAP sentences whose cited transcript turns fully support the clinical content of the sentence, meaning the cited evidence is sufficient to justify the statement. Avg is the mean of Test 1, Test 2, and Test 3.
Table 11. Citation accuracy across the three ACI-BENCH test sets. Values report the percentage of generated SOAP sentences whose cited transcript turns fully support the clinical content of the sentence, meaning the cited evidence is sufficient to justify the statement. Avg is the mean of Test 1, Test 2, and Test 3.
MethodTest 1Test 2Test 3Avg
DOCLENS GPT 4-turbo70.62%62.26%60.54%64.33%
DocLens GPT-5.290.42%88.65%86.74%88.59%
KOSMOS GPT-5.2 (KG only)68.60%72.40%58.30%66.16%
KOSMOS GPT-5.2 (Transcript + Nodes)90.31%92.60%92.36%91.75%
KOSMOS GPT-5.2 (Transcript + KG)90.68%91.78%92.17%91.54%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Henry, R.; Gong, J. KOSMOS: Ontology-Based Knowledge Graph Scaffolding for Medical Documentation Generation. Information 2026, 17, 355. https://doi.org/10.3390/info17040355

AMA Style

Henry R, Gong J. KOSMOS: Ontology-Based Knowledge Graph Scaffolding for Medical Documentation Generation. Information. 2026; 17(4):355. https://doi.org/10.3390/info17040355

Chicago/Turabian Style

Henry, Ryan, and Jiaqi Gong. 2026. "KOSMOS: Ontology-Based Knowledge Graph Scaffolding for Medical Documentation Generation" Information 17, no. 4: 355. https://doi.org/10.3390/info17040355

APA Style

Henry, R., & Gong, J. (2026). KOSMOS: Ontology-Based Knowledge Graph Scaffolding for Medical Documentation Generation. Information, 17(4), 355. https://doi.org/10.3390/info17040355

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop