Edge-Hosted LLM-Assisted NICU Discharge Summary Generation: Field-Level Evaluation Using a Clinician-Defined Rubric
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design and Setting
2.2. Clinical Informatics Infrastructure
2.3. Automated Discharge Summary Generation: MORPHEUS
2.4. Prompt Architecture
2.5. Evaluation Framework
2.6. Blinded Comparative Evaluation
2.7. Error Taxonomy
2.8. Iterative Prompt Refinement
2.9. Pipeline-Stage Ablation Analysis
3. Results
3.1. Study Cohort
3.2. Pipeline Performance
3.3. Rubric-Based Evaluation
3.4. Pipeline-Stage Ablation Analysis
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ARCHITECT | Bedside data acquisition system name used in the study |
| BP | Blood Pressure |
| CC BY | Creative Commons Attribution |
| CI | Confidence Interval |
| CNS | Central Nervous System |
| DOL | Day of Life |
| EMR | Electronic Medical Record |
| FiO2 | Fraction of Inspired Oxygen |
| ICU | Intensive Care Unit |
| HR | Heart Rate |
| I/O | Intake and Output |
| IoT | Internet of Things |
| IQR | Interquartile Range |
| LLM | Large Language Model |
| MORPHEUS | Automated discharge summary generation pipeline name used in the study |
| MRN | Medical Registration Number |
| NEO | Embedded audio–video module name used in the platform |
| NICU | Neonatal Intensive Care Unit |
| NVMe | Non-Volatile Memory Express |
| pCO2 | Partial Pressure of Carbon Dioxide |
| pH | Potential of Hydrogen |
| ROP | Retinopathy of Prematurity |
| RR | Respiratory Rate |
| SD | Standard Deviation |
| SpO2 | Peripheral Capillary Oxygen Saturation |
| SSD | Solid-State Drive |
| TB | Terabyte |
References
- Baumann, L.A.; Baker, J.; Elshaug, A.G. The impact of electronic health record systems on clinical documentation times: A systematic review. Health Policy 2018, 122, 827–836. [Google Scholar] [CrossRef]
- Genes, N.; Sills, J.; Heaton, H.A.; Shy, B.D.; Scofi, J. Addressing note bloat: Solutions for effective clinical documentation. JACEP Open. 2025, 6, 100031. [Google Scholar] [CrossRef]
- Rule, A.; Bedrick, S.; Chiang, M.F.; Hribar, M.R. Length and redundancy of outpatient progress notes across a decade at an academic medical center. JAMA Netw. Open 2021, 4, e2115334. [Google Scholar] [CrossRef]
- Gaffney, A.; Woolhandler, S.; Cai, C.; Bor, D.; Himmelstein, J.; McCormick, D.; Himmelstein, D.U. Medical documentation burden among US office-based physicians in 2019: A national study. JAMA Intern Med. 2022, 182, 564–566. [Google Scholar] [CrossRef]
- Hauschildt, K.E.; Hechtman, R.K.; Prescott, H.C.; Iwashyna, T.J. Crit Care Explorummaries are insufficient following ICU stays: A qualitative study. Crit. Care Explor. 2022, 4, e0715. [Google Scholar] [CrossRef] [PubMed]
- Caruso, L.B.; Thwin, S.S.; Brandeis, G.H. Following up on clinical recommendations in transitions from hospital to nursing home. J. Aging Res. 2014, 2014, 873043. [Google Scholar] [CrossRef] [PubMed]
- Were, M.C.; Li, X.; Kesterson, J.; Cadwallader, J.; Asirwa, C.; Khan, B.A.; Rosenman, M.B. Adequacy of hospital discharge summaries in documenting tests with pending results and outpatient follow-up providers. J. Gen. Intern. Med. 2009, 24, 1002–1006. [Google Scholar] [CrossRef] [PubMed]
- Ng, I.K.S.; Tung, D.; Seet, T.; Yow, K.S.; Chan, K.L.E.; Teo, D.B.; Chua, C.E. How to write a good discharge summary: A primer for junior physicians. Postgrad. Med. J. 2025, 101, 764–772. [Google Scholar] [CrossRef]
- Feblowitz, J.C.; Wright, A.; Singh, H.; Samal, L.; Sittig, D.F. Summarization of clinical information: A conceptual model. J. BioMed Inform. 2011, 44, 688–699. [Google Scholar] [CrossRef]
- Murad, M.H.; Vaa Stelling, B.E.; West, C.P.; Hasan, B.; Simha, S.; Saadi, S.; Firwana, M.; Viola, K.E.; Prokop, L.J.; Nayfeh, T.; et al. Measuring documentation burden in healthcare. J. Gen. Intern. Med. 2024, 39, 2837–2848. [Google Scholar] [CrossRef]
- Lenert, L.A.; Sakaguchi, F.H.; Weir, C.R. Rethinking the discharge summary: A focus on handoff communication. Acad. Med. 2014, 89, 393–398. [Google Scholar] [CrossRef]
- Hu, D.; Zhang, S.; Liu, Q.; Zhu, X.; Liu, B. Large language models in summarizing radiology report impressions for lung cancer in Chinese: Evaluation study. J. Med. Internet Res. 2025, 27, e65547. [Google Scholar] [CrossRef]
- Woo, B.F.Y.; Cato, K.; Cho, H.; You, S.B.; Song, J. The use of large language models in clinical documentation: A scoping review. Int. J. Nurs. Stud. 2026, 176, 105322. [Google Scholar] [CrossRef] [PubMed]
- Butt, F.; Varghese, N.; Elhadidi, A.; Abdulrahman, S.; Ben Ayad, A. Standardized neonatal ICU progress note template and feedback system: A clinical documentation improvement initiative. Cureus 2024, 16, e69971. [Google Scholar] [CrossRef] [PubMed]
- Williams, C.Y.K.; Subramanian, C.R.; Ali, S.S.; Apolinario, M.; Askin, E.; Barish, P.; Cheng, M.; Deardorff, W.J.; Donthi, N.; Ganeshan, S.; et al. Physician- and large language model-generated hospital discharge summaries: A blinded comparative quality and safety study. JAMA Intern. Med. 2025, 185, 818–825. [Google Scholar] [CrossRef] [PubMed]
- Lyu, M.; Peng, C.; Paredes, D.; Chen, Z.; Chen, A.; Bian, J.; Wu, Y. UF-HOBI at “Discharge Me!”: A Hybrid Solution for Discharge Summary Generation through Prompt-Based Tuning of GatorTronGPT Models. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, Bangkok, Thailand, 16 August 2024. [Google Scholar]
- Klang, E.; Gill, J.; Sharma, A.; Leibner, E.; Sabounchi, M.; Freeman, R.; Kohli-Seth, R.; Kovatch, P.; Charney, A.W.; Stump, L.; et al. Summarize-then-Prompt: A novel prompt engineering strategy for generating high-quality discharge summaries. Appl. Clin. Inform. 2025, 16, 1325–1331. [Google Scholar] [CrossRef]
- Mudumbai, S.C.; Chung, P.; Chen, J.Q.; Litake, O.; Regala, S.; Madhok, J.; Krause, M.; Pearl, R.G.; Boateng Evans, A.; Rodney, G. Evaluating large language model performance in generating clinically relevant intensive care unit discharge summaries. A&A Pract. 2025, 19, e02057. [Google Scholar] [CrossRef]
- Hains, L.; Kleinig, O.; Murugappa, A.; Gluck, S.; Marks, J.; Gilbert, T.; Bacchi, S. Large language model discharge summary preparation using real-world electronic medical record data shows promise. Intern. Med. J. 2025, 55, 1188–1192. [Google Scholar] [CrossRef]
- Mehri, T.; Nadalini, T.; Hoekman, A.H.; van der Laan, T.P.; Kagialari, K.; Wagner, R.K.; Doornberg, J.N.; Schoonbeek, R.C.; Bootsma-Robroeks, C.M.; Aalderink, M.; et al. Assessing the quality of AI-generated and physician-written discharge summaries: Evaluation of an EHR-integrated tool in a Dutch academic hospital. EBioMedicine 2026, 127, 106247. [Google Scholar] [CrossRef]
- Rust, P.; Frings, J.; Meister, S.; Fehring, L. Evaluation of a large language model to simplify discharge summaries and provide cardiological lifestyle recommendations. Commun. Med. 2025, 5, 208. [Google Scholar] [CrossRef]
- Singh, H.; Kaur, R.; Gangadharan, A.; Pandey, A.K.; Manur, A.; Sun, Y.; Saluja, S.; Gupta, S.; Palma, J.P.; Kumar, P. Neo-bedside monitoring device for integrated neonatal intensive care unit (iNICU). IEEE Access 2019, 7, 7803–7813. [Google Scholar] [CrossRef]
- Singh, H.; Yadav, G.; Mallaiah, R.; Joshi, P.; Joshi, V.; Kaur, R.; Bansal, S.; Brahmachari, S. iNICU—Integrated neonatal care unit: Capturing neonatal journey in an intelligent data way. J. Med. Syst. 2017, 41, 132. [Google Scholar] [CrossRef]
- Sun, Y.; Kaur, R.; Gupta, S.; Paul, R.; Das, R.; Cho, S.J.; Anand, S.; Boutilier, J.J.; Saria, S.; Palma, J.; et al. Development and validation of high definition phenotype-based mortality prediction in critical care units. JAMIA Open 2021, 4, ooab004. [Google Scholar] [CrossRef]
- Singh, H.; Cho, S.J.; Gupta, S.; Kaur, R.; Sunidhi, S.; Saluja, S.; Pandey, A.K.; Bennett, M.V.; Lee, H.C.; Das, R.; et al. Designing a bed-side system for predicting length of stay in a neonatal intensive care unit. Sci. Rep. 2021, 11, 3342. [Google Scholar] [CrossRef] [PubMed]
- NVIDIA. Jetson AGX Orin Specifications. Available online: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ (accessed on 15 May 2026).
- SanDisk. WD_BLACK SN850X NVMe SSD Product Specifications. Available online: https://www.sandisk.com/products/ssd/internal-ssd/wd-black-sn850x-nvme-ssd?sku=WDS100T2X0E (accessed on 15 May 2026).
- NVIDIA. JetPack 6.2 Release Notes and SDK Documentation. NVIDIA Developer. Available online: https://developer.nvidia.com/embedded/jetpack-sdk-62 (accessed on 15 May 2026).
- OpenAI. GPT-4o Mini: Advancing Cost-Efficient Intelligence. 2024. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 15 May 2026).
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.-B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Clinical text summarization: Adapting large language models can outperform human experts. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process Syst. 2020, 33, 1877–1901. [Google Scholar]
- Boateng, G.O.; Neilands, T.B.; Frongillo, E.A.; Melgar-Quiñonez, H.R.; Young, S.L. Best practices for developing and validating scales for health, social, and behavioral research: A primer. Front. Public Health 2018, 6, 149. [Google Scholar] [CrossRef]
- Burke, H.B.; Sessums, L.L.; Hoang, A.; Becher, D.; Fontelo, P.; Liu, F.; Stephens, M.; Pangaro, L.N.; O’Malley, P.; Baxi, N.S.; et al. QNOTE: An instrument for measuring the quality of EHR clinical notes. J. Am. Med. Inform. Assoc. 2014, 21, 910–916. [Google Scholar] [CrossRef]
- Kripalani, S. Care Transitions. PSNet, Agency for Healthcare Research and Quality. Available online: https://psnet.ahrq.gov/perspective/care-transitions (accessed on 15 May 2026).
- Luo, Z.; Xie, Q.; Ananiadou, S. Factual consistency evaluation of summarization in the era of large language models. Expert Syst. Appl. 2024, 254, 124456. [Google Scholar] [CrossRef]
- Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Au Yeung, J.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Chen, D.; Dai, W.; et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 2024, 57, 248. [Google Scholar] [CrossRef]
- Tang, L.; Goyal, T.; Fabbri, A.R.; Laban, P.; Xu, J.; Yavuz, S.; Kryściński, W.; Rousseau, J.F.; Durrett, G. Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 11626–11644. Available online: https://aclanthology.org/2023.acl-long.650.pdf (accessed on 15 May 2026).
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. SELF-REFINE: Iterative refinement with self-feedback. arXiv 2023, arXiv:2303.17651. [Google Scholar] [CrossRef]
- Ramnath, K.; Zhou, K.; Guan, S.; Mishra, S.S.; Qi, X.; Shen, Z.; Wang, S.; Woo, S.; Jeoung, S.; Wang, Y.; et al. A systematic survey of automatic prompt optimization techniques. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 33078–33110. Available online: https://aclanthology.org/2025.emnlp-main.1681.pdf (accessed on 15 May 2026).
- Zhu, Z.; Zhou, H.; Feng, Z.; Li, T.; Deryl, C.J.J.; Onn, M.L.; Ng, G.W.; Mao, K. Rethinking prompt optimizers: From prompt merits to optimization. arXiv 2025, arXiv:2505.09930. [Google Scholar] [CrossRef]
- Freedman, S.; Golberstein, E.; Huang, T.-Y.; Satin, D.; Smith, L.B. Docs with their eyes on the clock? The effect of time pressures on primary care productivity. J. Health Econ. 2021, 77, 102442. [Google Scholar] [CrossRef]
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.-B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef]
- Kleinig, O.; Hains, L.; Murugappa, A.; Gluck, S.; Marks, J.; Gilbert, T.; Bacchi, S. Time for a rethink? Discharge summary completion is often delayed and associated with increased readmission. Intern. Med. J. 2025, 55, 1575–1577. [Google Scholar] [CrossRef]
- Kripalani, S.; Jackson, A.T.; Schnipper, J.L.; Coleman, E.A. Promoting effective transitions of care at hospital discharge: A review of key issues for hospitalists. J. Hosp. Med. 2007, 2, 314–323. [Google Scholar] [CrossRef] [PubMed]
- Patel, S.Y.; Palma, J.P.; Hoffman, J.M.; Lehmann, C.U. Neonatal informatics: Past, present and future. J. Perinatol. 2024, 44, 773–776. [Google Scholar] [CrossRef] [PubMed]
- Ellsworth, M.A.; Lang, T.R.; Pickering, B.W.; Herasevich, V. Clinical data needs in the neonatal intensive care unit electronic medical record. BMC Med. Inform. Decis. Mak. 2014, 14, 92. [Google Scholar] [CrossRef] [PubMed]







| Study | Primary Task | Setting | Model/ Pipeline | Study Approach | Main Findings and Relevance to Present Study |
|---|---|---|---|---|---|
| Woo et al., 2025/2026 [13] | Adjacent task: scoping review | Cross-setting | Review of 41 studies | Mapped LLM applications in clinical documentation; many studies focused on note generation, discharge summaries, and encounter documentation; common issues included hallucinations, complex-case performance, privacy, and clinician trust | Provided broader contextual support for LLM-assisted clinical documentation research by summarizing reported benefits in efficiency, readability, and standardization across multiple healthcare settings. The review also highlighted persistent concerns regarding hallucinations, factual reliability, clinician trust, privacy, and degraded performance in clinically complex cases, reinforcing the importance of grounded evaluation and safety-focused summarization workflows. |
| Butt et al., 2024 [14] | Adjacent task: NICU documentation standardization | NICU | Standardized progress-note template and feedback system | Improved NICU documentation structure/compliance through template redesign and feedback | Reinforced that neonatal documentation is structurally unique and benefits from standardized organization and feedback-driven workflows. However, the work focused on template compliance and documentation consistency rather than automated synthesis of prolonged neonatal hospital courses across heterogeneous EMR sources. |
| Williams et al., 2025 [15] | Discharge summary generation | Adult general inpatient; UCSF hospital discharge summaries | LLM-generated discharge narratives compared with physician-authored narratives | Blinded comparative quality and safety study of 100 discharge summaries; LLM and physician summaries were rated comparably overall, with LLM outputs more concise/coherent but less comprehensive | Demonstrated feasibility of blinded clinician evaluation of LLM-generated discharge summaries in real clinical workflows, with strengths in coherence and conciseness. However, residual gaps in comprehensiveness emphasized the challenge of preserving clinically important details during automated summarization. |
| Lyu et al. (UF-HOBI), 2024 [16] | Discharge summary generation | Shared-task benchmark, not NICU-specific | Two-stage hybrid pipeline: NER extraction + prompt-tuned GatorTronGPT | Generated two discharge sections (“Brief Hospital Course” and “Discharge Instructions”); ranked 5th with overall score 0.284 | Relevant for demonstrating hybrid structured-generation approaches and handling dispersed records. However, the study evaluated selected discharge components rather than complete discharge synthesis and did not assess clinician-grounded safety, longitudinal neonatal complexity, or field-level documentation completeness. |
| Klang et al., 2025 [17] | Discharge summary generation | Adult inpatient discharge summaries | Prompt-engineering strategy: Summarize-then-Prompt | Evaluated whether summarizing individual notes before final prompting improves discharge summary generation | Supports the importance of prompt architecture and iterative summarization strategies for long clinical records. However, the work did not address the role of prompt orchestration and staged synthesis approaches for handling temporally distributed documentation. |
| Mehri et al., 2026 [20] | Automated discharge summary generation and evaluation | Dutch academic hospital; multi-specialty inpatient setting | EHR-integrated GPT-4o discharge summary generation pipeline | Compared physician-written and AI-generated discharge summaries across multiple specialties using blinded clinician evaluation | Demonstrated that EHR-integrated LLM-generated discharge summaries can achieve quality comparable to physician-written summaries in real-world workflows, although completeness gaps and need for specialty-specific refinement remained. Supports the importance of clinically grounded evaluation frameworks and structured prompting approaches. |
| Rust et al., 2025 [21] | Adjacent task: simplification/patient-facing adaptation | Adult cardiology discharge summaries | GPT-4o, full-text vs. segment-wise prompting | Simplified existing discharge summaries and generated lifestyle recommendations; improved readability and was rated largely correct/complete/harmless by experts | Demonstrates value of section-wise prompting and discharge-document adaptation workflows. Supports the importance of modular prompting strategies for transforming complex clinical narratives into targeted outputs for different audiences. |
| Mudumbai et al., 2025 [18] | ICU discharge summary generation/evaluation | ICU setting | LLM evaluation for clinically relevant ICU discharge summaries | Evaluated LLM performance on ICU discharge summary generation, emphasizing the challenge of summarizing complex ICU courses | Particularly relevant because ICU workflows are closer to NICU complexity than general inpatient settings. However, adult ICU workflows differ substantially from neonatal care, and available reports did not establish NICU-specific templates, longitudinal neonatal constraints, or clinician-grounded field-level evaluation. |
| Hains et al., 2025 [19] | Discharge summary preparation/generation | Real-world EMR data; adult hospital setting | LLM-based discharge summary preparation from EMR data | Demonstrated promise using real-world EMR-derived records for discharge summary preparation | Operationally relevant because it used authentic EMR-derived discharge workflows. However, published evaluation details on section-level safety, omission analysis, and longitudinal complexity remained limited. |
| Initial Assessment | |||
|---|---|---|---|
| Presenting problem and relevant antenatal/perinatal context, Delivery details and APGAR scores if available, Immediate postnatal events, Admission anthropometry (weight, length, OFC), Initial vitals (HR, RR, SpO2, BP, CRT), Early systemic examination findings and initial differential impression. | |||
| Feeding and Nutrition | Medications | Infections | CNS |
| Feeding initiation Feeding mode PN/EN timeline Advancement pattern Max tolerated feed Growth Velocity (g/kg/day) | Drug name Route Dose correctness Duration | Suspected/confirmed Sepsis workup Antibiotics timeline Culture result Duration of therapy Source of suspected infection (maternal, device, community, hospital) | Tone and reflex exam Seizure documentation Neuroimaging Neurological impression Tone evolution direction (improving/stable/worsening) |
| Vitals | Respiratory Distress Summary | Jaundice | Apnea |
| HR range RR range SpO2 BP Temperature CRT | Onset timing Respiratory mode(s) used FiO2 range Escalation/weaning timeline Surfactant use Respiratory severity scale (Silverman/Downes) | Onset timing Phototherapy type Phototherapy timeline Bilirubin values Threshold interpretation Exchange transfusion status | Onset timing Frequency/severity Intervention documented Resolution timing Trigger identifier (central/obstructive/secondary cause) |
| Shock | ROP | Procedures | Investigations |
| Hypoperfusion signs Shock intervention Resolution timing Type (hypovolemic/septic/ cardiogenic) | Screening timing Zone Stage Plus disease Intervention and follow-up PMA (Post-Menstrual Age) at screening | Procedure type Indication Date/timeline Outcome/status | Hematology Metabolic Microbiology Imaging Abnormal values |
| Clinical Accuracy | Completeness/Missing Data Detection | Actionability and Continuity of Care |
|---|---|---|
| 1.0 = Factually correct, clinically plausible, matches chart, values. 0.5 = Minor vagueness, non-specific phrasing, rounding or omission of units, but clinically acceptable. 0 = Factually incorrect, physiologically impossible, contradicts chart/evidence, or misinterprets findings. | 1.0 = All required sub-elements documented with meaningful detail and measurable values when applicable. 0.5 = ≥50% elements present OR partially descriptive without quantification. 0.0 = Largely missing placeholders, generic descriptors, or value-less text. | 1.0 = Provides a clear next step, monitoring instruction, follow-up requirement, threshold or escalation trigger. 0.5 = Gives some guidance but lacks specificity, interval, or responsible provider/location. 0.0 = Descriptive without implied clinical decision or continuity relevance. |
| Coherence and Timeline Validation | Non-Hallucination Scoring | |
| 1.0 = Chronicle flows logically, dates/DOL consistent across the entire summary, no reverse-time events, and resolution follows intervention. 0.5 = Timeline implied but not explicit OR mild ambiguity without contradiction. 0.0 = Contradictory sequences, impossible chronology, or circular logic. | 1.0 = Fully traceable to primary record, timestamp, nursing/doctor notes, orders, reports, or validated clinical logic. 0.5 = Ambiguous origin OR seems inferred rather than documented, but possibly true. 0.0 = Invented, guessed, reverse-engineered, or clearly absent from the chart. |
| Characteristics | Gestational Age Groups, Weeks | ||||
|---|---|---|---|---|---|
| 23–25 (n = 3) | 26–28 (n = 12) | 29–31 (n = 40) | 32–34 (n = 65) | ≥35 (n = 281) | |
| Gestation * | 23.9 (0.0) | 26.2 (0.9) | 29.9 (0.8) | 32.9 (0.9) | 37.2 (1.6) |
| Birth weight (grams) * | 630.0 (26.5) | 820.4 (126.8) | 1138.7(306.8) | 1749.0 (426.2) | 2691.9 (555.7) |
| Gender (Male) | 2 (66.7%) | 9 (75.0%) | 30 (75.0%) | 41 (63.1%) | 175 (62.3%) |
| Conception type (IVF) | 2 (66.7%) | 1 (8.3%) | 5 (12.5%) | 15 (23%) | 21 (7.4%) |
| Length of Stay (LOS) # | 17.7 (15.5) | 59.5 (63.5) | 35.3 (50.75) | 14.2 (15) | 7.7 (9) |
| Antenatal Steroids | 0 (0.0%) | 8 (66.7%) | 27 (67.5%) | 38(58.4%) | 36(12.8%) |
| Mode of delivery (LSCS) | 2 (66.7%) | 7 (58.3%) | 32 (80.0%) | 55 (84.6%) | 214(76.5%) |
| Outborn | 2 (66.7%) | 2 (16.7%) | 17 (42.5%) | 17 (26.15%) | 82 (29.18%) |
| Baby type (Multiple) | 2(66.7) | 2(16.7%) | 7 (17.5%) | 16 (24.6%) | 26 (9.25%) |
| APGAR #—One minute | 6.0(0.0) | 5.2 (1.5) | 5.8 (1.0) | 7.4 (1.0) | 7.7 (2.0) |
| APGAR #—Five minutes | 7.0(0.0) | 7.2 (1.75) | 7.8 (2.0) | 8.8 (2.0) | 9.0 (1.0) |
| Birth head circumference | 21.5(1.1) | 23.9 (1.1) | 27.5 (2.5) | 29.9 (2.9) | 33.1 (2.1) |
| Jaundice needing phototherapy | NA | 2 (16.7%) | 1 (2.5%) | 11 (16.9%) | 104 (37.3%) |
| Sepsis | 3 (100.0%) | 4 (30.8%) | 11 (30.6%) | 17 (27.4%) | 50 (17.9%) |
| Respiratory Distress Syndrome | 3 (100.0%) | 11 (84.6%) | 35 (97.2%) | 52 (83.9%) | 154 (55.2%) |
| Stage 1 | Stage 1.5 | |||||
|---|---|---|---|---|---|---|
| Time per patient (seconds) | 62.73 | 169.5 | ||||
| LLM Calls (per patient/total in stage) | 12/4950 | 7/2642 | ||||
| Sections | Patient with these sections | Mean time (seconds) | Mean number of characters | Number of sections refined | Average time per section (seconds) | Mean number of characters |
| Vitals | 401 | 2.61 | 22,117 | 19 | 43.31 | 71,570 |
| Medications | 401 | 7.06 | 16,688 | 251 | 4.36 | 75,196 |
| Nursing notes | 401 | NA | 43,161 | 0 | NA | NA |
| Feeding and Nutrition | 401 | 5.02 | 11,651 | 0 | NA | NA |
| Procedures | 277 | 5.61 | 3670 | 2 | 21.1 | 79,721 |
| Initial Assessment (birth, antenatal, status at admission) | 400 | 1.57/ 2.44/ 5.79 | 1273 | 0 | NA | NA |
| Demographics | 401 | 10.07 | 543 | 0 | NA | NA |
| Investigation summary | 372 | 18.63 | 220,171 | 26 | 0 | 14 |
| Respiratory distress summary | 100 | 1.78 | 21,994 | 121 | 116.21 | 57,757 |
| Assessments (CNS, shock, apnea, infection, jaundice, ROP) | 0, 34, 6, 78, 190, 0 | 0, 3.5, 1.5, 3.6, 1.21, 0 | 52,858, 53,060, 52,873, 52,858, 54,791, 0 | 106, 148, 203, 144, 28, 17 | 39.88, 44.07, 34.55, 12.31, 24.25, 30.79 | 90,892, 86,952, 81,050, 101,697, 70,291 |
| Dimension | Total Fields Compared ^ | Human | LLM | Reward/Penalty/Neutral |
|---|---|---|---|---|
| Clinical Accuracy | 3142 | 0.75 | 0.95 | 842/141/2159 |
| Completeness | 3129 | 0.67 | 0.92 | 1157/177/1795 |
| Actionability | 2614 | 0.72 | 0.94 | 801/122/1691 |
| Coherence | 2616 | 0.74 | 0.95 | 721/105/1790 |
| Non-hallucination | 2619 | 0.77 | 0.96 | 610/79/1930 |
| Error Type | LLM Total | Human Total | LLM Rate/ Patient | Human Rate /Patient | LLM Rate /100 Fields | Human Rate /100 Fields | Relative Reduction (%) | p-Value |
|---|---|---|---|---|---|---|---|---|
| Omission | 1010 | 2703 | 2.519 | 6.74 | 3.498 | 9.362 | 62.63 | <0.001 |
| Unsupported assertion | 31 | 54 | 0.077 | 0.135 | 0.107 | 0.187 | 42.59 | 0.017 |
| Contradiction | 43 | 112 | 0.107 | 0.279 | 0.149 | 0.388 | 61.61 | <0.001 |
| Omission | Unsupported | Contradiction | Total Errors | |||||
|---|---|---|---|---|---|---|---|---|
| Section | LLM | Human | LLM | Human | LLM | Human | LLM | Human |
| Vitals | 119 | 1032 | 3 | 9 | 26 | 32 | 148 | 1073 |
| Medications | 409 | 489 | 3 | 19 | 10 | 34 | 422 | 542 |
| Investigation Summary | 332 | 350 | 9 | 12 | 1 | 10 | 342 | 372 |
| Procedures | 20 | 342 | 3 | 4 | 2 | 13 | 25 | 359 |
| Jaundice | 32 | 107 | 2 | 2 | 2 | 2 | 36 | 111 |
| Section | AI Mean ± SD | Clinician Mean ± SD | Mean Difference | 95% CI | p-Value | Relative Improvement (%) |
|---|---|---|---|---|---|---|
| Apnea | 0.007 ± 0.086 | 0.057 ± 0.307 | 0.05 | 0.019 to 0.080 | 0.002 | 86.96 |
| CNS | 0.045 ± 0.230 | 0.145 ± 0.359 | 0.1 | 0.061 to 0.138 | <0.001 | 68.97 |
| Feeding And Nutrition | 0.007 ± 0.086 | 0.157 ± 0.503 | 0.15 | 0.101 to 0.198 | <0.001 | 95.24 |
| Infections | 0.032 ± 0.191 | 0.065 ± 0.293 | 0.032 | 0.002 to 0.063 | 0.036 | 50.0 |
| Initial Assessment | 0.022 ± 0.179 | 0.162 ± 0.460 | 0.14 | 0.092 to 0.187 | <0.001 | 86.15 |
| Investigation Summary | 0.853 ± 1.103 | 0.928 ± 1.193 | 0.075 | −0.059 to 0.208 | 0.417 | 8.06 |
| Jaundice | 0.090 ± 0.286 | 0.277 ± 0.534 | 0.187 | 0.130 to 0.244 | <0.001 | 67.57 |
| Medications | 1.052 ± 1.338 | 1.352 ± 1.341 | 0.299 | 0.147 to 0.452 | <0.001 | 22.14 |
| Procedures | 0.062 ± 0.330 | 0.895 ± 1.403 | 0.833 | 0.697 to 0.969 | <0.001 | 93.04 |
| ROP | 0.082 ± 0.275 | 0.100 ± 0.308 | 0.017 | −0.017 to 0.052 | 0.317 | 17.5 |
| Respiratory Distress Summary | 0.017 ± 0.131 | 0.192 ± 0.530 | 0.175 | 0.125 to 0.224 | <0.001 | 90.91 |
| Shock | 0.062 ± 0.242 | 0.150 ± 0.378 | 0.087 | 0.046 to 0.129 | <0.001 | 58.33 |
| Vitals | 0.369 ± 0.666 | 2.676 ± 2.229 | 2.307 | 2.094 to 2.520 | <0.001 | 86.21 |
| Metric | Iteration 1 | Iteration 15 | Absolute Change | % Change | p-Value |
|---|---|---|---|---|---|
| Accuracy | 0.937 | 0.943 | 0.006 | 0.6 | 0.124 |
| Completeness | 0.913 | 0.933 | 0.020 | 2.0 | 0.262 |
| Coherence | 0.941 | 0.961 | 0.020 | 2.0 | 0.520 |
| Actionability | 0.935 | 0.957 | 0.022 | 2.2 | 0.892 |
| Non-hallucination | 0.952 | 0.974 | 0.022 | 2.2 | 0.670 |
| Omission rate/patient | 2.484 | 1.807 | −0.677 | −27 | 0.028 |
| Unsupported assertions/patient | 0.095 | 0.072 | −0.023 | −24 | 0.317 |
| Contradictions/patient | 0.095 | 0.095 | 0 | 0 | 0.8842 |
| Dimension | LLM Mean (SD) | Human Mean (SD) | Mean Difference | 95% CI | p-Value | Cohen’s d |
|---|---|---|---|---|---|---|
| Accuracy | 0.927 (0.106) | 0.748 (0.231) | 0.178 | 0.154–0.201 | <0.01 | 0.742 |
| Completeness | 0.911 (0.124) | 0.671 (0.230) | 0.24 | 0.202–0.251 | <0.01 | 0.90 |
| Coherence | 0.935 (0.110) | 0.741 (0.264) | 0.19 | 0.163–0.217 | <0.01 | 0.7 |
| Actionability | 0.931 (0.134) | 0.724 (0.265) | 0.202 | 0.174–0.229 | <0.01 | 0.734 |
| Non-hallucination | 0.951 (0.137) | 0.784 (0.259) | 0.161 | 0.135–0.188 | <0.01 | 0.61 |
| Metric | Human Summaries | Stage 0: Generic Single-Stage Prompt | Stage 1: Section-Wise Generation | Stage 1.5: Targeted Refinement | Stage 2: Final MORPHEUS Summary |
|---|---|---|---|---|---|
| Accuracy | 0.748 | 0.858 | 0.953 | 0.961 | 0.956 |
| Completeness | 0.671 | 0.794 | 0.935 | 0.959 | 0.939 |
| Coherence | 0.741 | 0.842 | 0.965 | 0.961 | 0.966 |
| Actionability | 0.724 | 0.840 | 0.962 | 0.959 | 0.962 |
| Non-hallucination | 0.784 | 0.875 | 0.970 | 0.988 | 0.974 |
| Omissions/patient | 7.64 | 5.12 | 3.00 | 2.76 | 3.16 |
| Unsupported/patient | 0.16 | 0.00 | 0.12 | 0.00 | 0.08 |
| Contradictions/patient | 0.36 | 0.16 | 0.36 | 0.08 | 0.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Singh, H.; Kaur, R.; Saluja, S.; Cho, S.J.; Sun, Y.; McAdams, R.M. Edge-Hosted LLM-Assisted NICU Discharge Summary Generation: Field-Level Evaluation Using a Clinician-Defined Rubric. Healthcare 2026, 14, 1457. https://doi.org/10.3390/healthcare14111457
Singh H, Kaur R, Saluja S, Cho SJ, Sun Y, McAdams RM. Edge-Hosted LLM-Assisted NICU Discharge Summary Generation: Field-Level Evaluation Using a Clinician-Defined Rubric. Healthcare. 2026; 14(11):1457. https://doi.org/10.3390/healthcare14111457
Chicago/Turabian StyleSingh, Harpreet, Ravneet Kaur, Satish Saluja, Su Jin Cho, Yao Sun, and Ryan M. McAdams. 2026. "Edge-Hosted LLM-Assisted NICU Discharge Summary Generation: Field-Level Evaluation Using a Clinician-Defined Rubric" Healthcare 14, no. 11: 1457. https://doi.org/10.3390/healthcare14111457
APA StyleSingh, H., Kaur, R., Saluja, S., Cho, S. J., Sun, Y., & McAdams, R. M. (2026). Edge-Hosted LLM-Assisted NICU Discharge Summary Generation: Field-Level Evaluation Using a Clinician-Defined Rubric. Healthcare, 14(11), 1457. https://doi.org/10.3390/healthcare14111457

