1. Introduction
Surgical pathology reports contain the most granular descriptions of cancer diagnosis, staging, margin status, lymph node involvement, and biomarker findings, making them indispensable for cancer surveillance, quality assessment, and secondary clinical research [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. Yet, these data are still documented predominantly as narrative free text. This creates a translational gap between what pathologists report and what registries, analytics pipelines, and downstream clinical systems can computationally reuse. As a result, high-value pathological detail often must be manually re-entered, simplified, or discarded before it becomes available for structured surveillance.
The bottleneck is therefore not only extraction accuracy but also representation. A durable abstraction system must preserve organ-specific semantics, nested relationships, and variable-length structures such as lymph node groups, specimen margins, and biomarker panels. Flat field lists or ad hoc prompt outputs are insufficient because they frequently collapse clinically meaningful context, making the results harder to validate, compare across institutions, and integrate into longitudinal data infrastructures.
Recent large language model (LLM) studies have shown encouraging performance for oncology information extraction, but most have focused on narrow tasks, limited variable sets, or model-specific demonstrations [
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28]. These advances are important, yet model capabilities evolve rapidly, and implementation details can age quickly. In contrast, the clinical ontology that governs what should be abstracted, how values are typed, and how repeating structures are organized is the more durable scientific contribution. A schema-first design can therefore outlast any single generation of models while improving reproducibility and interoperability.
The CAP protocols [
29] provide an appropriate foundation for such a design because they define stable clinically governed data elements for cancer reporting. Translating these standards into strictly typed hierarchical schemas makes it possible to explicitly encode registry logic, programmatically constrain outputs, and separate clinical representation from the inference engine itself. This separation is especially important for privacy-preserving deployment, because institutions can retain on-premises control of protected health information while updating local models over time without rewriting the clinical abstraction layer.
Accordingly, we present Digital Registrar, a schema-first framework for privacy-preserving pathology abstraction using local LLMs. Rather than treating structured extraction solely as a model-performance problem, the framework implements CAP-aligned clinical ontologies across ten major cancer types and 192 per-organ scalar field cells (60 unique field names when same-name fields shared across organs are de-duplicated)—including eight clinically acted-on biomarker sub-fields (breast ER/PR/HER2/Ki-67 and colorectal MLH1/MSH2/MSH6/PMS2) and two variable-length nested structures (regional lymph-node groups and surgical margin enumerations)—within a model-agnostic DSPy pipeline executed on a single 48 GB GPU; the full schema composition is summarized in
Section 2.2 with the per-organ enumeration in
Supplementary Tables S1–S10. We evaluate the framework on 893 internal pathology reports under a thirty-run multi-seed protocol and on 242 external TCGA reports under a twenty-two-run protocol, demonstrating registry-grade accuracy while establishing a portable foundation for automated cancer surveillance, institutional data reuse, and future multimodal clinical systems [
30,
31,
32,
33,
34].
3. Results
3.1. Assessment of Model Agnosticism and Computational Feasibility
The proposed framework is architecturally model-agnostic: the DSPy 3.2.1 layer compiles each dspy.Signature declaration into a JSON schema, which the inference backend converts to a grammar-constrained sampling mask. The extraction logic—per-organ module enumeration, cascade gating, and schema typing—is independent of the model weights; an LLM enters the pipeline as an interchangeable inference engine. To validate this architectural property in practice, we screened three open-weight LLMs end to end on the 893-report CMUH cohort in a single-run pilot model-selection study against the v1 working gold standard (i.e., prior to the committee-resolved gold standard adjudication step described in
Section 2.1.2 and
Section 3.7). All three models executed an identical pipeline structure on the same dedicated NVIDIA RTX A6000 Ada Generation GPU (48 GB VRAM).
Table 3 shows that gpt-oss-20b was selected as the production inference engine on the basis of the best speed–accuracy balance: it achieved the highest pilot exact-match accuracy and 2–3× faster per-report latency than the alternatives. All headline accuracies reported in
Section 3.3,
Section 3.4,
Section 3.5,
Section 3.6,
Section 3.7,
Section 3.8,
Section 3.9 and
Section 3.10 of this paper use gpt-oss-20b under the 30-run multi-seed committee-resolved gold standard protocol.
Based on these pilot accuracies, gpt-oss-20b was selected as the production engine. Under the 30-run multi-seed protocol against the committee-resolved gold standard, the corresponding gpt-oss-20b per-organ macro-mean accuracy on the 192-field scope is 90.9% (
Section 3.3). The lower headline value reflects four protocol differences from this pilot: (i) 30-run macro average vs. single run; (ii) 192-field paper scope vs. 193-field v10 scope; (iii) committee-resolved gold standard vs. v1 working gold standard; and (iv) per-organ macro mean vs. dataset-pooled mean. The architectural claim—that LLMs serve as interchangeable inference engines under DSPy—stands on this pilot; the headline performance is what the selected production pipeline delivers. A detailed architectural rationale (per-organ DSPy module enumeration, sparse-MoE routing overhead and bandwidth notes, decoding parameters) is provided in
Supplementary Section S1.4.
3.2. Automated Report Triage and Organ Classification
Across the fixed 893-report validation cohort, repeated over 30 independent inference runs, the eligibility classifier achieved a report-level majority accuracy of 95.74% [95% Wilson CI: 94.21, 96.88], using independent reports as the interval denominator. The report-level sensitivity was 99.85% [99.18, 99.97] (684/685 gold-eligible reports), the specificity was 82.21% [76.44, 86.81] (171/208 gold-ineligible reports), and Cohen’s κ was 0.873 (
Table 4). No report tied under the 30-run majority rule. The corresponding pooled case-run accuracy was 95.68% (25,632/26,790 case-runs), retained only as a repeated-inference performance summary rather than as an independent clinical sample size. The run-to-run reliability is reported separately in
Section 3.8 and
Supplementary Table S35. The false positives clustered in clinically ambiguous instances, such as large metastatic deposits or extensive biopsy specimens. Detailed statistics on accuracy and stability of organ classification is reported in
Supplementary Tables S11–S13 and S26, Supplementary Figure S5. A heatmap summary for organ-classification confusion matrix.
For organ classification, the report-level endpoint was restricted to gold-eligible reports that also passed the majority eligibility gate (n = 684; one gold-eligible report failed the majority eligibility gate). Majority organ assignment reached 96.93% [95.35, 97.98] accuracy, with macro-F1 = 0.942 and Cohen’s κ = 0.966. The corresponding pooled Stage-B case-run summary was 97.11% (19,924/20,517 case-runs) within this report-level denominator; the chapter-wide pooled summary was 19,938/20,531 when all Stage-B case-runs were counted. These pooled summaries are retained only as repeated-inference descriptions. F1 was ≥0.95 across the ten in-scope organs in the pooled per-class table, and the dominant residual error pattern involved the gold-“others” class, quantified separately in Section Multi-Primary and Out-of-Scope Triage.
Multi-Primary and Out-of-Scope Triage
Across the 30 multi-seed runs, 1138 gold-standard “others” ledger entries were evaluated. At Stage B, the dispositions were as follows: 449 (39.5%) were correctly triaged to “others”; 631 (55.4%) were misrouted into a single in-scope organ; and 58 (5.1%) were over-flagged as “others”. Decomposing the 631 misroutes by the free-text annotation yielded 180 (28.5%) synchronous multi-primaries within an in-scope organ system, 163 (25.8%) out-of-scope sites approximated to a neighboring organ, and 288 (45.6%) phrasings that the regex could not confidently classify; lung was the dominant misroute attractor (
n = 150), followed by pancreas (
n = 108). A full disposition table is provided in
Figure 2/
Supplementary Table S30; per-organ misroute counts are shown in
Supplementary Figure S8/Table S31.
3.3. Per-Organ Extraction Accuracy Across Ontology Fields
After classification, each cancer-surgery report was processed through the corresponding organ-specific extraction modules. Per-organ schemas are listed in
Supplementary Tables S1–S10, and per-organ per-field accuracy tables are provided in
Supplementary Tables S14–S23. For the
Section 3.3 main-text evaluation scope, 192 registry field cells across ten organ systems were evaluated after collapsing repeated inferences to case-field-level majority correctness labels before Wilson intervals were computed.
The per-organ majority accuracy ranged from 89.35% [95% Wilson CI: 86.09, 91.92] in the cervix to 95.36% [94.43, 96.14] in the prostate, with a per-organ macro mean of 92.03% (
Table 5;
Figure 3;
Supplementary Table S27). The corresponding pooled case-run-field macro mean was 90.75% and is retained only as a descriptive repeated-inference summary. The bottom-decile field error attribution underlying this gradient is decomposed in
Section 3.3.2 using a six-mechanism taxonomy: M1, source-bound qualitative phrasing; M2, heterogeneous reporting/glossary-rule fixable; M3, anatomic ontology gap; M4, AJCC convention enforcement; M5, schema shape; and M6, unresolved model-side semantic conflation. The breast pathologic-versus-anatomic stage-group comparison is reported in
Section 3.3.1. The run-to-run stability is reported separately in
Section 3.8; no field exhibited a parse error or schema-validation failure across the thirty runs.
Note for readers comparing to
Section 3.1/
Table 3. The
Section 3.1/
Table 3 pilot value for gpt-oss-20b (94.30%, single run against the v1 working gold standard over the 193-field v10 evaluation scope) is not directly comparable to the
Section 3.3/
Table 5 headline of 92.03%. The
Section 3.3 protocol differs from the pilot on four axes: (i) case-field-level majority aggregation across thirty repeated inference runs versus a single run; (ii) 192-field paper scope versus 193-field v10 scope; (iii) committee-resolved gold standard versus v1 working gold standard prior to committee adjudication; and (iv) per-organ macro mean versus dataset-pooled mean. The values in
Section 3.3/
Table 5 therefore represent the headline performance of the production pipeline under the revised committee-gold protocol.
3.3.1. Anatomic vs. Pathologic Stage Group Disambiguation in Breast Reports
The breast schema separates two stage-group fields, anatomic_stage_group and pathologic_stage_group, evaluated on the same 75 independent breast reports. Repeated inferences were collapsed to case-field-level majority correctness before interval estimation. The two fields showed markedly different performance: anatomic_stage_group reached 98.67% [95% Wilson CI: 92.83, 99.76], whereas pathologic_stage_group reached 80.00% [69.59, 87.49], an 18.67 percentage-point gap (
Figure 4;
Supplementary Table S28). The corresponding pooled repeated-inference accuracies were 97.95% and 74.87%, respectively, and are retained only as descriptive repeated-inference summaries.
Manual review indicated that the residual errors were structural rather than random noise. Three AJCC 8-related mechanisms accounted for most discrepancies: (i) post-treatment y-descriptor cases, for which AJCC 8 directs assignment to clinical prognostic stage, while the institutional source reports often provide only anatomic_stage_group; (ii) HER2 2+ equivocal cases pending fluorescence in situ hybridization (FISH), for which prognostic stage is genuinely indeterminate from the pathology report alone; and (iii) residual narrative ambiguity in non-y reports, where the model tended to default to anatomic staging. Two complementary structural mitigations—AJCC 8 stage-table injection and a multi-valued pathologic_stage_group representation for indeterminate cases—are proposed in
Section 4.
3.3.2. Bottom-Decile Field Error Attribution
Nine registry fields fell below 90% majority accuracy after repeated inferences were collapsed at the case-field level. We attributed these residual errors using a six-mechanism taxonomy that separates likely engineering remedies: M1, source-bound qualitative phrasing; M2, heterogeneous reporting or null-on-no-clue behavior; M3, anatomic ontology gap; M4, AJCC convention enforcement; M5, schema shape; and M6, unresolved model-side semantic conflation. M1 did not surface under the current majority-aggregated denominator.
Four fields fell under M2: surgical_technique (83.5%), tumor_necrosis (87.5%), cancer_clock (88.0%), and cancer_quadrant (76.0%). The surgical-technique residual was driven largely by heterogeneous prefix vocabulary, such as 3D or robotic descriptors, that the current schema does not fully disambiguate. One M4 case was distant_metastasis, with 16.4% [95% Wilson CI: 13.8, 19.4] majority accuracy, reflecting an AJCC convention problem in which absent evidence of metastasis is inconsistently represented between the source narrative and registry gold.
M5 cases included pathologic_stage_group (80.0%;
Section 3.3.1) and the two breast biomarker fields HER2 (77.3%) and Ki-67 (86.7%), where the schema over-specifies the biomarker sub-record. Specifically, HER2 should expose only clinically meaningful score/status information, whereas Ki-67 should expose only the percentage; extraneous schema slots invite hallucinated or mismatched values. The pooled repeated-inference companion table still flagged colorectal tumor_invasion as a model-side semantic-confusion residual, but after case-field majority aggregation its accuracy rose above the bottom-decile threshold. Similarly, the global procedure field did not fall below 90% after majority aggregation, although liver procedure remains the canonical M3 example because mapping segmental liver resections onto CAP partial-hepatectomy categories requires anatomic segment-counting. This decomposition reframes the residual from a generic “low-performing field” problem into specific components that are closable by glossary/rule fixes, schema revision, ontology expansion, or further model-side semantic improvement. The accuracies of bottom decile field and mechanisms of error attribution are briefly illustrated in
Figure 5, and explained in detail in
Supplementary Tables S29 and S64.
3.4. Lymph Node and Surgical Margin Status
Across the nine organ systems for which the schema enumerates explicit surgical-margin categories, excluding prostate because its CAP protocol uses prostate-specific scalar margin fields, the pipeline identified margin involvement, defined as any-margin positivity, with high report-level majority accuracy. After repeated inferences were collapsed to report-level majority results, any-margin positivity ranged from 93.33% [95% Wilson CI: 85.32, 97.12] in breast to 100.00% in cervix, colorectum, lung, and thyroid. The residual list-level errors were dominated by hallucinated margin entries, defined as predicted margin elements with no gold counterpart, rather than missed margin elements. Hallucination rates were highest in lung, thyroid, liver, esophagus, and stomach, whereas the miss rates remained uniformly low. These hallucination and miss rates are retained as pooled list-level descriptive summaries rather than report-level Wilson endpoints. Because hallucinated entries generally defaulted to margin_involved = false in the gold-aligned scoring, this list-level hallucination pattern did not materially degrade the clinically critical any-margin-positive endpoint. Per-organ margin summaries are provided in
Supplementary Table S32, Supplementary Figures S12 and S13, with the prostate scalar-field margin callout reported in
Supplementary Table S33.
Any-positive nodal status, the clinically critical endpoint asking whether any regional lymph node was involved, was near ceiling after report-level majority aggregation. Nine of ten organ systems reached 100.00% majority correctness, and thyroid reached 98.61% [92.54, 99.75]. More granular lymph-node endpoints remained more challenging. In the pooled repeated-inference summary, total-count concordance with ±1 tolerance ranged from 96.0% in colorectum to 45.5% in esophagus. The harder per-station group-recall endpoint showed a recall–precision asymmetry: prostate, cervix, and esophagus had low recall but comparatively high precision, consistent with station-name canonicalization gaps rather than node fabrication. Because any-positive nodal correctness, the endpoint most directly affecting downstream stage assignment, was near ceiling in nine of ten organs, the remaining station-level errors are best interpreted as recoverable engineering and ontology-mapping issues rather than immediate clinical-safety failures. Per-organ lymph-node summaries are provided in
Supplementary Table S34 and Supplementary Figure S9. A single-run snapshot result is reported in
Supplementary Table S24.
3.5. Breast Biomarker Extraction Performance
Breast biomarker extraction was evaluated on 75 independent breast reports for the four clinically actionable receptor markers, with repeated inferences collapsed to report-biomarker majority correctness before Wilson intervals were computed. ER and PR each reached 98.67% [95% Wilson CI: 92.83, 99.76] majority accuracy (74/75), Ki-67 reached 86.67% [77.17, 92.59] (65/75), and HER2 was the lowest of the four at 77.33% [66.66, 85.34] (58/75). The corresponding pooled repeated-inference summaries were similar in point estimate—ER 98.67%, PR 98.31%, Ki-67 86.25%, and HER2 77.31%—and are retained only descriptively.
The bottom-decile field attribution in
Section 3.3.2 identifies HER2 and Ki-67 as M5 schema-shape residuals. HER2 has a single clinically meaningful immunohistochemistry score/status dimension, whereas the current schema exposes score, percentage, and positivity slots; the extraneous slots have no clinical referent and invite hallucinated values. Ki-67 is the complementary case: the clinically meaningful value is the staining percentage, whereas the additional score and positivity slots create unnecessary error surfaces. A schema revision that keeps score/status only for HER2 and percentage only for Ki-67 was identified during re-audit and is queued for the next pipeline iteration. ER and PR sit near the upper performance limit, because their schemas are already appropriately narrow.
Colorectal mismatch-repair biomarkers—MSH2, MSH6, MLH1, and PMS2—were evaluated on 72 independent colorectal reports after case-field majority aggregation. The majority accuracy was 91.67% [82.99, 96.12] for MSH2 (66/72) and 94.44% [86.57, 97.82] for MSH6, MLH1, and PMS2 (68/72 each). The MSH2 accuracy is partly bounded by the source coverage, because a minority of CMUH colorectal reports do not document MMR immunohistochemistry; these gold-null cases occasionally elicit a default model output, the same “silent source to LLM hallucination” pattern catalogued for TCGA in
Section 3.9, but at a lower rate in the internal cohort where MMR immunohistochemistry is usually part of the routine workup.
The run-to-run reliability across the thirty runs was at the noise floor for the four breast biomarkers: the within-case accuracy SD was below 0.01 for ER, PR, and HER2 categorical positivity, and the accuracy flip rate was below 1% across all four. These observations indicate that the remaining biomarker errors are primarily source- or schema-bound rather than run-stochastic.
For the same breast cohort, BRCA1, BRCA2, and TP53 mutation status were not present at meaningful rates in the source surgical-pathology narratives. These markers are typically reported in separate molecular-pathology workups rather than in surgical-resection reports. Their integration is therefore deferred to the planned multimodal genomics extension, rather than added to the current schema without a consistent source-data layer.
3.6. Component Ablations
To quantify the contribution of the engineering choices that distinguish Digital Registrar from a naïve free-text or single-prompt approach, we performed a five-cell component-ablation study on the internal CMUH cohort against the same committee-resolved gold standard. The ablation tested progressively stripped-down variants of the proposed pipeline by removing per-organ decomposition, the ReportJsonize pre-pass, the DSPy structured-decoding framework, and prompt-level schema discipline, as defined in
Section 2.4.3.
Using the majority/natural case-field sensitivity analysis for the chapter-3 scalar-field subset, the proposed dspy_modular pipeline achieved 88.26% [95% Wilson CI: 87.75, 88.75] accuracy. The performance decreased to 84.36% [83.78, 84.91] for dspy_monolithic_no_jsonize, 83.74% [83.16, 84.31] for dspy_monolithic, and 81.53% [80.92, 82.12] for raw_json. The schema-blind free_text_regex baseline fell to 18.17% [17.57, 18.79] (
Figure 6;
Supplementary Tables S45–S54). Modular advantage per field and seed consistency of each field are reported in
Supplementary Figures S1 and S2. These ablation results are not directly comparable to the
Section 3.3 per-organ macro mean of 92.03%, because the aggregation target and run protocol differ.
The main qualitative conclusion was robust across the ablation grid: the prompt-level schema discipline was load-bearing, as the schema-blind free-text baseline showed a large performance drop relative to the proposed pipeline. The DSPy framework was especially important for the chapter-5 lymph-node group extraction task, where removal of the framework in the raw_json variant markedly reduced the micro-F1 and increased the hallucination and miss rates. In contrast, several scalar field subsets showed smaller differences among the DSPy- or JSON-constrained variants. Detailed field-type stratification, nested-list error modes, matched-pair analyses, and the single-seed interpretation caveat for the stripped-down variants are provided in
Supplementary Tables S44–S54 and Extended Methods S1.3.1.
3.7. Inter-Annotator Agreement and Pre-Annotation Effect
The committee-resolved gold standard was constructed by two annotating pathologists, Kai-Po Chang and Nan-Haw Chow, with adjudication by a senior third pathologist, Han Chang. Under the proposed clinical workflow, in which both annotators edited the LLM-pre-annotated JSON draft, the two pathologists achieved a weighted-mean Cohen’s κ of 0.844 across 41 categorical fields (
n = 9598 case-field pairings). At the highest-level report-triage organ-class decision across 12 categories, Cohen’s κ was 0.936 [0.919, 0.950], indicating high inter-annotator agreement (
Figure 7;
Supplementary Tables S56–S58). A secondary nominal Krippendorff’s α cross-check was concordant at 0.968. Detailed inter-annotator summaries are reported in
Supplementary Figures S3, S10 and S11.
To test whether LLM pre-annotation systematically anchored the annotators, both annotators independently re-annotated a stratified 196-case subset without the LLM seed. Across 93 annotator × organ × field cells with defined Δκ, the mean Δκ was +0.015 and the median was 0.0; 14 cells improved, 7 worsened, and 72 were unchanged (
Supplementary Figure S16 and Supplementary Tables S59–S63). Thus, the headline κ would have been numerically similar without the LLM seed. Finally, adjudication fairness was assessed among cells with at least one KPC–NHC disagreement; no systematic bias toward either annotator was detected after Holm correction (
Supplementary Table S57). Pre-annotation effect per organ/field is documented in detail in
Supplementary Figure S4.
3.8. Multi-Run Reliability
Across the 30 independent inference runs of the gpt-oss-20b pipeline, the run-to-run reliability of the upstream classification stages was very high. For eligibility classification, ICC(3, k = 30) was 0.9959 [0.9955, 0.9963], with single-run ICC(2,1) = 0.889 and a per-case flip rate of 3.6%. For organ classification, ICC(3,k) was 0.9963 [0.9959, 0.9967], with ICC(2,1) = 0.900 and a flip rate of 1.8%. These values indicate near-perfect reliability of the 30-run aggregate, while the single-run ICC and flip-rate values quantify the residual seed-to-seed variability.
The per-field reliability was computed on the same 71-field cascade-evaluation set used in the component-ablation analysis. This scope corresponds to the 192-cell
Section 3.3 paper scope expanded by the separately reported biomarker and prostate-margin sub-fields described in
Section 2.2. Thus,
Section 3.3,
Section 3.6 and
Section 3.8 use the same underlying inference outputs but differ in reporting denominator and aggregation target:
Section 3.3 reports per-organ macro means after case-field majority aggregation,
Section 3.6 reports component-ablation summaries, and
Section 3.8 reports seed-to-seed reliability statistics.
For list-typed extractions, the per-case F1 standard deviation was organ-dependent. The noisiest cells were prostate lymph-node groups (SD 0.40), cervix lymph-node groups (0.36), esophagus lymph-node groups (0.35), and lung margins (0.16), reflecting the same station-name canonicalization and anatomic-complexity gradients described in
Section 3.4. The full per-organ per-field reliability results are provided in
Supplementary Table S35 and Supplementary Figures S14 and S15.
Accuracy intervals and multi-run reliability therefore answer distinct questions: Wilson intervals estimate uncertainty in majority-collapsed report-level or case-field-level accuracy, whereas ICC, Cronbach’s α, flip rates, and per-case run SD quantify seed-to-seed stability without treating the 30 runs as 30 independent clinical cohorts.
3.9. Multi-Run TCGA Cascade Evaluation (Full Schema, n = 242 Reports)
We evaluated the proposed local-LLM pipeline on the full per-organ schema across 242 TCGA reports spanning breast, colorectal, esophagus, liver, stomach, and thyroid. The external cohort was processed under a 22-seed multi-run protocol using identical prompts and decoding parameters, with only the random seed varying across runs. The difference between the 30-run internal protocol and the 22-run external protocol was due to compute-budget constraints rather than a methodological difference. The TCGA evaluation scope included approximately 124 per-organ scalar field cells, corresponding to roughly 51 unique scalar field names across the six organs, and is therefore not directly comparable to the full ten-organ internal scope described in
Section 2.2.
After report-level majority aggregation, eligibility triage reached 99.59% [95% Wilson CI: 97.70, 99.93], and organ classification reached 98.76% [96.40, 99.58], consistent with the internal cohort and indicating that the pipeline transported cleanly at the cascade-gating stages. At the per-field extraction stage, after case-field-level majority aggregation, the per-organ macro mean was 77.48% (
Supplementary Table S36; Supplementary Figure S6), ranging from 84.77% [82.05, 87.13] in esophagus to 72.92% [70.33, 75.35] in colorectal. Pooled repeated-inference summaries are retained only as descriptive companions. Per-organ accuracy is reported in
Supplementary Table S37.
The gap between the internal per-organ macro mean of 92.03% and the external TCGA mean of 77.48% appears to be driven largely by TCGA source-documentation properties rather than by model capacity alone. Older TCGA breast reports often omit hormone-receptor workups; mismatch-repair immunohistochemistry is not consistently performed or reported across TCGA contributing institutions; AJCC stage cannot be reliably assigned without an edition statement, which most TCGA reports do not carry; and pathologists use different conventions for coding the absence of distant metastasis, such as pM0 versus Mx. Thirteen of the 51 evaluated fields were therefore tagged in
Supplementary Table S38 and Supplementary Figure S7 with specific documentation caveats explaining their low effective accuracy in TCGA.
Excluding these 13 caveat-tagged fields raised the case-field majority accuracy to 88.02% [86.99, 88.97] in a within-v2 sensitivity analysis. Both readings are reported because they answer different questions: 77.48% represents the full-schema external-validation headline, whereas 88.02% estimates the performance after documentation-side caveats are set aside. The four-system comparison in
Section 3.10 shows that the rule-based extractor, BERT encoder, local language model, and commercial API model all showed reduced performance on the same caveat-tagged fields, supporting the interpretation that the residual external-validation gap is largely a corpus-documentation property rather than a failure of any single extraction architecture.
3.10. Baseline Comparison on the External TCGA Cohort
Four systems were evaluated head-to-head on TCGA-Reports (
n = 242 across breast, colorectal, esophagus, liver, stomach, and thyroid) under the same Stage A/B/C cascade defined in
Section 2.4. Stage C scalar-field effective accuracies were local LLM, 77.02% [76.76, 77.27]; API LLM, 75.22% [74.74, 75.70]; rule-based, 65.16%; and BERT-merged, 40.06% (
Figure 8;
Supplementary Table S39). On variable-length list fields, the API LLM wins as expected (margins F1 0.831 vs. 0.783; lymph node F1 0.686 vs. 0.587). Detailed per-stage and per-organ benchmark results for baseline comparision are reported in
Supplementary Tables S40–S44. The local pipeline is operationally comparable to the closed-frontier API on scalar accuracy while keeping the cohort on-premises; the structural ceiling on rule-based and BERT-merged systems (which cannot emit nested variable-length structures by construction) explains why neither is a viable substitute for schema-rich extraction.
4. Discussion
This study demonstrates that a clinically governed ontology, paired with schema-bound local-LLM execution, can convert free-text surgical pathology reports into registry-grade structured data at scale. The pipeline achieved high case-field-level majority accuracy across ten malignancies and the 192-cell paper scope while remaining feasible on a single workstation-class GPU, supporting practical deployment for document triage, organ classification, and fine-grained abstraction.
Although LLMs now show strong performance on clinical extraction tasks, the field has faced both computational and representational barriers [
24,
36]. Our CAP-aligned ontology addresses the representational barrier by defining a reusable clinical target that can outlast rapid changes in model architecture, while the on-premises gpt-oss-20b deployment addresses the computational and privacy barriers. Separating the schema logic from model weights also allowed the same clinical abstraction layer to be executed across different inference backends.
The four-system cascade comparison on the external TCGA cohort (
Section 3.10) clarifies where the proposed local-LLM pipeline sits in the methodological landscape. On Stage-C scalar field extraction, the local pipeline reached 77.47%, compared with 71.97% for the commercial API reference, 62.40% for the rule-based extractor, and 29.44% for the BERT-merged encoder. These results support the value of schema-constrained language-model extraction over narrower rule-based or encoder-only approaches, while not establishing a categorical local-versus-commercial model ordering because the API and local configurations were not matched for context length or model family. On variable-length list fields, the commercial API retained an advantage, consistent with the need for further work on nested extraction. The BERT-merged system additionally illustrates an architectural limitation of encoder-only baselines: they do not natively emit nested variable-length structures required for schema-rich registry workflows.
A consistent and clinically meaningful failure mode involved the distinction between anatomic and pathologic stage groups under AJCC 8 prognostic staging. On the same 75 paired breast reports (
Section 3.3.1), after case-field-level majority aggregation, anatomic_stage_group reached 98.67% [92.83, 99.76], whereas pathologic_stage_group reached 80.00% [69.59, 87.49], an 18.67 percentage-point gap. This gap is not primarily a stochastic model-disambiguation failure; rather, it decomposes into structural dimensions including post-treatment y-descriptor cases, HER2 2+ equivocal cases requiring FISH to finalize prognostic stage, and residual narrative ambiguity in non-y reports. The appropriate mitigations are therefore structural: AJCC 8 stage-table injection and a multi-valued pathologic_stage_group representation for indeterminate cases. For more insight on future research direction on this topic, please see
Supplementary Item S8 for detail.
The current cancer-data schema models a single primary tumor per report, a deliberate choice aligned with the registry target of one extraction record per primary, but it is one that creates a structural blind spot for synchronous multi-primary disease. The Stage-B triage analysis in Section Multi-Primary and Out-of-Scope Triage quantified this directly: among 1138 repeated-inference ledger entries involving gold-standard “others” dispositions, 449 (39.5%) were correctly triaged to “others”, 631 (55.4%) were misrouted into a single in-scope organ, and 58 (5.1%) were over-flagged. This motivates a flag-for-review workflow in the current revision and a future schema-cardinality extension to support multiple linked cancer-entity records when more than one primary tumor is present. For more insight on future research direction on this topic, please see
Supplementary Item S9 for detail.
The bottom-decile audit after case-field majority aggregation supports a six-mechanism attribution rather than a single undifferentiated error category. Some residuals are likely closable by glossary or rule updates, such as the heterogeneous reporting of surgical technique, tumor necrosis, cancer clock, and cancer quadrant. Others reflect schema-shape problems, including over-specified biomarker sub-records for HER2 and Ki-67 or AJCC convention issues such as pM0 versus Mx handling. Still others, such as liver procedure mapping and lymph-node station canonicalization, point to anatomic ontology gaps. Importantly, no bottom-decile pattern was driven by parse-error or schema-validation failure, supporting the reliability of the structured-output layer itself. For more insight on future research direction on this topic, please see
Supplementary Item S10 for detail.
Hereditary and tumor-suppressor markers, including BRCA1, BRCA2, and TP53, are clinically important but were not present at meaningful rates in the surgical-resection narratives in either the CMUH cohort or the TCGA breast surgical-pathology subset used for external validation. These markers are typically documented in separate molecular-pathology workups under distinct reporting workflows. Adding them as schema fields without supporting source-document content would create structural false negatives. The appropriate next step is therefore to integrate molecular workups as a separately linked source layer with explicit provenance, rather than forcing these variables into the surgical-pathology extraction schema.
A key future direction is extending this framework into a multimodal system. The schema-first JSON design can accommodate multimodal extension without breaking existing fields: whole-slide image features can be added as a linked image-derived layer, and genomic data can be added as a sibling molecular layer with their own source identifiers, extractor versions, and timestamps. Pathology-tuned vision models, including recent vision-language systems [
37,
38,
39], could therefore be incorporated as additional extractors without rewriting the clinical ontology. In this design, the CAP-aligned clinical core remains the durable anchor, while new modalities attach as provenance-preserving layers.
The limitations of this study include the single-institution internal development cohort, the current single-primary schema cardinality, source-documentation sensitivity in the TCGA cohort, and unresolved model-side semantic conflation for a small subset of fields. The next validation extension should therefore include a federated multi-institutional cohort, explicit multi-primary representation, and linked molecular-pathology and image-derived data sources.