Figure 1.
Overview of the Knowledge graph Ontology Supported Medical Output System (KOSMOS) process. A raw transcript is first pre-processed to add turn labels and make the language more clear. Then it identifies clinically relevant mentions and assign each a type (e.g., problem, activity, medication, lab test, measurement). Mentions are then consolidated into canonical concepts that aggregate evidence across turns. Finally, the resulting concepts converted into nodes and are connected into a knowledge graph using typed, directed relationships to represent the encounter as an explicit set of entities and relations. The square brackets “[]” refer to what transcript turn the transcript or mention comes from. The curly braces “{}” refer to what set of transcript turns the concepts are derived from.
Figure 2.
Example encounter knowledge graph representing a simple doctor–patient visit. Nodes depict typed entities (patient, clinician, condition, symptom, lab test, medication, procedure, and activity) and directed edges encode typed relationships (e.g., evaluated_by, diagnosed, ordered_test, has_medication) that connect the patient to clinically relevant facts.
Figure 3.
KG construction pipeline used in KOSMOS. Each box shows a pipeline stage, with the stage name at the top and the artifact produced or updated at the bottom. The process transforms a raw transcript into a structured KG by segmenting and normalizing turns, rewriting pronouns for clarity, extracting and typing mentions, grouping mentions into candidate entities, grounding candidates to ontology concepts, constructing typed KG nodes with attributes, proposing relationship pair candidates, and selecting typed relationships to form the final graph.
Figure 4.
Sample knowledge graph constructed from an ACI-BENCH encounter transcript. Nodes represent concepts extracted from the dialogue, and directed edges denote schema-constrained clinical relations predicted over candidate node pairs.
Figure 5.
Average ROUGE scores across three ACI-BENCH test sets. Each value represents the mean of ROUGE-1, ROUGE-2, and ROUGE-Lsum, which measure n-gram overlap and sequence-level similarity between generated and reference notes. BART Large SAMSum (Division) achieves the highest overall ROUGE scores across all test sets. Our structured-context variants—KOSMOS GPT-5.2 (Transcript + Nodes) and KOSMOS GPT-5.2 (Transcript + KG)—consistently outperform the transcript-only DocLens baselines, demonstrating that incorporating extracted nodes or KGs improves summary fidelity.
Figure 6.
Average ACI-BENCH aggregate scores across test sets. Each point summarizes overall note quality by first averaging ROUGE-1, ROUGE-2, and ROUGE-Lsum, then averaging that ROUGE mean with BERTScore F1, BLEURT, and MEDCON. Not all model series contain values for every test set because ACI-BENCH did not report the full metric bundle for every baseline between their paper and the released repository, so missing points are left blank rather than imputed. Across the reported test sets, the strongest results are achieved by KOSMOS GPT-5.2 (Transcript + KG) and KOSMOS GPT-5.2 (Transcript + Nodes), which form the top tier and sit above the DocLens GPT-5.2 and DocLens GPT-4-turbo baselines. DocLens models are superior to the non-LLM baselines, while KOSMOS (KG only) improves over several baselines but trails the transcript-conditioned DocLens and KOSMOS variants.
Figure 7.
DocLens-style claim evaluation averages across models. Recall is the percentage of claims in the gold (reference) SOAP note that also appear in the generated SOAP note. Precision is the percentage of claims in the generated SOAP note that also appear in the gold SOAP note. Grounded Rate* is the percentage of generated claims whose that can be fully justified by the transcript without unsupported additions. KOSMOS GPT-5.2 (Transcript + Nodes) and KOSMOS GPT-5.2 (Transcript + KG) are nearly identical across all three metrics. Both are very close to DocLens GPT-5.2 in precision and grounding rate, while showing a slight edge in recall, suggesting improved coverage of gold claims without sacrificing support.
Table 1.
Encounter KG size statistics across the three ACI-BENCH test sets.
| Statistic | Min | Mean | Max |
|---|
| Number of Nodes | 25 | 55.21 | 98 |
| Number of Relationships | 29 | 70.33 | 133 |
Table 2.
Test Set 1 ACI Benchmark Results. The section-wise BART Large SAMSum (Division) baseline achieves the strongest ROUGE-1, ROUGE-2, and ROUGE-Lsum scores and also leads BERTScore precision and F1, reflecting the highest lexical overlap among baselines. The GPT-based systems are most competitive overall, particularly on MEDCON. Among the KOSMOS variants, adding transcript context improves over the KG-only setting, and the Transcript + KG configuration achieves the best ROUGE-L, BERTScore recall, BLEURT, MEDCON, and Average score. Gray-highlighted, bold values indicate the highest result in each metric column.
| Model | Rouge-1 | Rouge-2 | Rouge-L | Rouge-Lsum | BERT-Precision | BERT-Recall | BERT-F1 | BLEURT | MEDCON | Average |
|---|
| BART Large | 0.4176 | 0.1920 | 0.2370 | 0.3470 | 0.6399 | 0.5707 | 0.6029 | 0.4105 | 0.4373 | 0.4255 |
| BART LARGE SAMSum | 0.4087 | 0.1896 | 0.2302 | 0.3460 | 0.6432 | 0.5692 | 0.6034 | 0.4177 | 0.4207 | 0.4220 |
| BART LARGE SAMSum (Division) | 0.5346 | 0.2508 | 0.2963 | 0.4862 | 0.6675 | 0.6828 | 0.6746 | 0.3852 | 0.4884 | 0.4754 |
| BioBART | 0.3909 | 0.1724 | 0.2151 | 0.3319 | 0.6407 | 0.5694 | 0.6025 | 0.3844 | 0.4285 | 0.4089 |
| BioBART (Division) | 0.4953 | 0.2247 | 0.2726 | 0.4492 | 0.6582 | 0.6704 | 0.6636 | 0.3573 | 0.4333 | 0.4411 |
| LED (Division) | 0.3046 | 0.0693 | 0.1121 | 0.2666 | 0.5251 | 0.5847 | 0.5530 | 0.1859 | 0.3234 | 0.2773 |
| DocLens GPT-4-turbo | 0.4915 | 0.1830 | 0.2731 | 0.4459 | 0.6453 | 0.6721 | 0.6579 | 0.4143 | 0.5766 | 0.4816 |
| DocLens GPT-5.2 | 0.4970 | 0.1864 | 0.2908 | 0.4598 | 0.6277 | 0.6775 | 0.6513 | 0.4129 | 0.6003 | 0.4873 |
| KOSMOS GPT-5.2 (KG only) | 0.4594 | 0.1790 | 0.2539 | 0.4235 | 0.5313 | 0.6313 | 0.5762 | 0.3998 | 0.5994 | 0.4608 |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 0.5217 | 0.2051 | 0.3036 | 0.4816 | 0.6280 | 0.6836 | 0.6542 | 0.4204 | 0.6055 | 0.4989 |
| KOSMOS GPT-5.2 (Transcript + KG) | 0.5251 | 0.2117 | 0.3085 | 0.4852 | 0.6297 | 0.6847 | 0.6558 | 0.4243 | 0.6180 | 0.5049 |
Table 3.
Test Set 2 ACI Benchmark Results. The KOSMOS GPT-5.2 variants outperform the baseline summarization models on most ROUGE metrics, with Transcript + KG achieving the best Rouge-1, Rouge-L, and Rouge-Lsum, while the division-based BART model attains the best Rouge-2. The OpenAI models remain strong, with DocLens GPT-4-turbo leading BERTScore precision and F1 and tying for the top BLEURT score, while DocLens GPT-5.2 improves MEDCON over GPT-4-turbo. Among the KOSMOS variants, adding the transcript context consistently improves over KG-only, and Transcript + KG yields the best BERTScore recall, MEDCON, and average score. Gray-highlighted, bold values indicate the highest result in each metric column.
| Model | Rouge-1 | Rouge-2 | Rouge-L | Rouge-Lsum | BERT-Precision | BERT-Recall | BERT-F1 | BLEURT | MEDCON | Average |
|---|
| BART Large | 0.4190 | 0.1987 | 0.2303 | 0.3456 | 0.6408 | 0.5647 | 0.6000 | 0.4040 | 0.4454 | 0.4265 |
| BART Large SAMSum | 0.4037 | 0.1886 | 0.2241 | 0.3426 | 0.6417 | 0.5607 | 0.5981 | 0.4099 | 0.4432 | 0.4237 |
| BART LARGE SAMSum (Division) | 0.5208 | 0.2437 | - | 0.4716 | - | - | - | - | 0.4812 | - |
| BioBART | 0.3900 | 0.1844 | 0.2208 | 0.3340 | 0.6417 | 0.5631 | 0.5995 | 0.3983 | 0.4319 | 0.4153 |
| BioBART (Division) | 0.5080 | 0.2270 | - | 0.4613 | - | - | - | - | 0.4476 | - |
| LED (Division) | 0.3514 | 0.0857 | 0.1265 | 0.3084 | 0.5301 | 0.5892 | 0.5580 | 0.2521 | 0.3424 | 0.3172 |
| DocLens GPT-4-turbo | 0.4980 | 0.1796 | 0.2621 | 0.4510 | 0.6564 | 0.6739 | 0.6647 | 0.4158 | 0.5639 | 0.4808 |
| DocLens GPT-5.2 | 0.4993 | 0.1914 | 0.2880 | 0.4611 | 0.6300 | 0.6735 | 0.6508 | 0.4083 | 0.5881 | 0.4847 |
| KOSMOS GPT-5.2 (KG only) | 0.4584 | 0.1785 | 0.2445 | 0.4226 | 0.5344 | 0.6258 | 0.5757 | 0.3943 | 0.5898 | 0.4570 |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 0.5190 | 0.2109 | 0.2982 | 0.4803 | 0.6316 | 0.6806 | 0.6550 | 0.4148 | 0.6016 | 0.4974 |
| KOSMOS GPT-5.2 (Transcript + KG) | 0.5237 | 0.2119 | 0.3038 | 0.4825 | 0.6355 | 0.6837 | 0.6586 | 0.4158 | 0.6126 | 0.5014 |
Table 4.
Test Set 3 ACI Benchmark Results. Several baseline rows report only ROUGE-1, ROUGE-2, Rouge-Lsum, and MEDCON, with Rouge-L and all BERTScore, BLEURT, and average values missing, so comparisons on embedding-based and learned metrics are only meaningful for DocLens and the KOSMOS variants. Within the metrics that are fully reported, the KOSMOS GPT-5.2 variants are strongest overall, with Transcript + KG achieving the best Rouge-L and Rouge-Lsum and also leading BERTScore recall, BLEURT, MEDCON, and the overall average. The division-based BART model attains the best Rouge-1 and Rouge-2 among all models, continuing its advantage on n-gram overlap metrics. The OpenAI models remain competitive, with DocLens GPT-4-turbo once again leading BERTScore precision and F1. Adding transcript context improves over KG-only for KOSMOS, and incorporating the full KG yields consistent gains over Transcript + Nodes across most metrics. Gray-highlighted, bold values indicate the highest result in each metric column.
| Model | Rouge-1 | Rouge-2 | Rouge-L | Rouge-Lsum | BERT-Precision | BERT-Recall | BERT-F1 | BLEURT | MEDCON | Average |
|---|
| BART Large | 0.4054 | 0.1852 | - | 0.3462 | - | - | - | - | 0.4492 | - |
| BART Large SAMSum | 0.3938 | 0.1838 | - | 0.3389 | - | - | - | - | 0.4601 | - |
| BART LARGE SAMSum (Division) | 0.5277 | 0.2438 | - | 0.4803 | - | - | - | - | 0.4756 | - |
| BioBART | 0.3832 | 0.1739 | - | 0.3339 | - | - | - | - | 0.4306 | - |
| BioBART (Division) | 0.5028 | 0.2295 | - | 0.4609 | - | - | - | - | 0.4321 | - |
| LED (Division) | 0.3471 | 0.0803 | - | 0.3077 | - | - | - | - | 0.3379 | - |
| DocLens GPT-4-turbo | 0.5020 | 0.1846 | 0.2761 | 0.4625 | 0.6664 | 0.6715 | 0.6685 | 0.4153 | 0.5758 | 0.4863 |
| DocLens GPT-5.2 | 0.5018 | 0.1920 | 0.2906 | 0.4670 | 0.6323 | 0.6735 | 0.6521 | 0.4128 | 0.6058 | 0.4907 |
| KOSMOS GPT-5.2 (KG only) | 0.4630 | 0.1833 | 0.2414 | 0.4308 | 0.5451 | 0.6291 | 0.5836 | 0.3875 | 0.6190 | 0.4643 |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 0.5216 | 0.2084 | 0.3021 | 0.4881 | 0.6347 | 0.6792 | 0.6561 | 0.4159 | 0.6228 | 0.5027 |
| KOSMOS GPT-5.2 (Transcript + KG) | 0.5241 | 0.2146 | 0.3070 | 0.4896 | 0.6386 | 0.6825 | 0.6597 | 0.4192 | 0.6294 | 0.5073 |
Table 5.
Claim recall percentage across three test sets. Recall is computed by extracting a set of reference claims from the gold SOAP notes and reporting the percentage of those reference claims that are also present in the SOAP notes generated by each method, penalizing omissions of important information.
| Model | Test 1 | Test 2 | Test 3 | Avg |
|---|
| DocLens GPT-4-turbo | 60.42% | 63.18% | 64.73% | 62.75% |
| DocLens GPT-5.2 | 80.71% | 82.70% | 81.04% | 81.48% |
| KOSMOS GPT-5.2 (KG only) | 75.74% | 76.82% | 77.39% | 76.65% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 82.94% | 83.07% | 82.25% | 82.75% |
| KOSMOS GPT-5.2 (Transcript + KG) | 81.66% | 82.62% | 82.90% | 82.39% |
Table 6.
Claim precision percentage across three test sets. Precision is computed by extracting a set of claims from each generated SOAP note and reporting the percentage of those generated claims that are also present in the corresponding gold SOAP note, penalizing unsupported or unnecessary additions.
| Model | Test 1 | Test 2 | Test 3 | Avg |
|---|
| DocLens GPT-4-turbo | 69.07% | 72.25% | 72.27% | 71.18% |
| DocLens GPT-5.2 | 71.75% | 71.25% | 70.31% | 71.10% |
| KOSMOS GPT-5.2 (KG only) | 69.32% | 69.30% | 70.14% | 69.59% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 70.02% | 73.07% | 71.25% | 71.44% |
| KOSMOS GPT-5.2 (Transcript + KG) | 71.91% | 72.57% | 71.23% | 71.90% |
Table 7.
Hallucination rate percentage across three test sets. The hallucination rate is computed by extracting a set of claims from each generated SOAP note and reporting the percentage of those claims that lack supporting evidence in the original transcript.
| Model | Test 1 | Test 2 | Test 3 | Avg |
|---|
| DocLens GPT-4-turbo | 5.11% | 4.09% | 3.90% | 4.37% |
| DocLens GPT-5.2 | 0.59% | 0.90% | 0.26% | 0.58% |
| KOSMOS GPT-5.2 (KG only) | 2.34% | 2.79% | 2.56% | 2.56% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 0.42% | 0.58% | 0.24% | 0.41% |
| KOSMOS GPT-5.2 (Transcript + KG) | 0.44% | 0.67% | 0.42% | 0.51% |
Table 8.
Recall significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
| Method | Mean 95% CI | W | p Value |
|---|
| DocLens GPT-4-turbo | [−21.26, −16.18]% | 83 | 3.73 × 10−17% |
| KOSMOS GPT-5.2 (KG only) | [−6.62, −3.10]% | 1255 | 1.04 × 10−4% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | [−0.16, 2.75]% | 1792 | 13.16% |
| KOSMOS GPT-5.2 (Transcript + KG) | [−0.51, 2.33]% | 1780 | 28.17% |
Table 9.
Precision significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
| Method | Mean 95% CI | W | p Value |
|---|
| DocLens GPT-4-turbo | [−2.28, 2.44]% | 3612.5 | 96.34% |
| KOSMOS GPT-5.2 (KG only) | [−3.40, 0.37]% | 3016.5 | 10.81% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | [−1.30, 1.97]% | 3049.5 | 51.92% |
| KOSMOS GPT-5.2 (Transcript + KG) | [−0.79, 2.31]% | 3171 | 44.55% |
Table 10.
Hallucination rate significance statistics using the paired Wilcoxon signed-rank test, computed by pooling Test Sets 1, 2, and 3 for a total of 120 transcript-note pairs. Each method is compared against DocLens GPT-5.2 using aligned per-note scores. represents the mean difference in percentage points (method minus DocLens GPT-5.2) with a 95% paired bootstrap confidence interval. W is the test statistic. The p-value is reported as a percentage, indicating the probability of observing the measured difference under the null hypothesis of no true difference between methods.
| Method | Mean 95% CI | W | p Value |
|---|
| DocLens GPT-4-turbo | [2.82, 4.81]% | 191.5 | 2.89 × 10−9% |
| KOSMOS GPT-5.2 (KG only) | [1.32, 2.64]% | 218.5 | 2.38 × 10−6% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | [−0.48, 0.12]% | 153.5 | 39.36% |
| KOSMOS GPT-5.2 (Transcript + KG) | [−0.43, 0.28]% | 197.5 | 90.03% |
Table 11.
Citation accuracy across the three ACI-BENCH test sets. Values report the percentage of generated SOAP sentences whose cited transcript turns fully support the clinical content of the sentence, meaning the cited evidence is sufficient to justify the statement. Avg is the mean of Test 1, Test 2, and Test 3.
| Method | Test 1 | Test 2 | Test 3 | Avg |
|---|
| DOCLENS GPT 4-turbo | 70.62% | 62.26% | 60.54% | 64.33% |
| DocLens GPT-5.2 | 90.42% | 88.65% | 86.74% | 88.59% |
| KOSMOS GPT-5.2 (KG only) | 68.60% | 72.40% | 58.30% | 66.16% |
| KOSMOS GPT-5.2 (Transcript + Nodes) | 90.31% | 92.60% | 92.36% | 91.75% |
| KOSMOS GPT-5.2 (Transcript + KG) | 90.68% | 91.78% | 92.17% | 91.54% |