Next Article in Journal
Stain-Standardized Deep Learning Framework for Robust Leukocyte Segmentation Across Heterogeneous Cytological Datasets
Previous Article in Journal
Beyond Linear Statistics: A Machine Learning Ecosystem for Early Screening of School Bullying
Previous Article in Special Issue
Fairness-Aware Intelligent Reinforcement (FAIR): An AI-Powered Hospital Scheduling Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generative Simulation and Summarization of Neonatal Patient Data

Department of Systems and Computer Engineering, Carleton University, Ottawa, ON K1S5B6, Canada
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2026, 17(3), 261; https://doi.org/10.3390/info17030261
Submission received: 14 December 2025 / Revised: 20 February 2026 / Accepted: 3 March 2026 / Published: 5 March 2026

Abstract

In the Neonatal Intensive Care Unit (NICU), clinicians must balance the demands of constant patient monitoring with the need for precise documentation and clear communication with colleagues and families. To address the clinical burden of documenting patient care and health status, this paper presents two complementary AI-based systems. First, a GAN-driven NICU Patient Simulator is developed to generate realistic neonatal vital sign data and discrete clinical intervention events, typical of care in the NICU. While useful for a variety of research goals, this simulator provides a safe and controllable data source essential for the development and validation of the second system: the LLM-powered Neonatal Patient Status Summarizer (NPSS). The NPSS fuses the output of multiple machine learning systems, each extracting specific aspects of patient care and health, together with vital sign data from a patient monitor. Leveraging Retrieval-Augmented Generation (RAG) to incorporate neonatal-specific reference data, the NPSS enables several key use cases, including generating parent-friendly updates, summarizing patient status for clinician handovers, and automatically populating patient records for charting. Simulator validation demonstrates the high fidelity of the simulated data relative to available infant data in Physionet. The NPSS is evaluated using an automated LLM-as-judge framework across repeated test scenarios. To mitigate self-preference bias, evaluations were conducted using three distinct LLM judges (OpenAI o3-mini, Llama-3, and Mistral). Across judges, the NPSS achieved consistently high relevance scores (0.95–0.99) and strong groundedness scores (0.80–0.91), indicating that generated summaries remain on-topic and faithful to the underlying simulator data. Once validated, the NPSS will reduce charting workload, improve shift handover efficiency, and streamline parental updates, addressing key clinical bottlenecks in NICU data workflows.

Graphical Abstract

1. Introduction

The Neonatal Intensive Care Unit (NICU) is a high-stress, high-stakes environment requiring continuous monitoring; however, manual clinical documentation remains time-consuming, error-prone, and discontinuous. A growing suite of machine learning (ML) algorithms is being developed using video and other non-contact sensing modalities, capable of estimating patient vital signs [1,2], clinical scene segmentation [3], patient movement [4], clinical interventions [5], and also environmental/contextual factors (e.g., patient coverage, lighting level) [6]. While these tools provide a wealth of data, a new challenge arises: how can these multimodal data be fused, filtered, and distilled into summaries that are meaningful for human interpretation to ease the clinical documentation burden?
Early attempts to automate clinical summaries relied on report templates, which offered structure but lacked the natural language and flexibility required for diverse clinical use cases [6]. The recent emergence of Large Language Models (LLMs) presents a significant opportunity to generate more natural and context-aware text summaries that can be tailored to different audiences. Recent surveys have reviewed the application of LLMs for biomedical text summarization, charting their evolution and potential uses across the healthcare domain [7,8,9].
We here propose the Neonatal Patient Status Summarizer (NPSS), as conceptualized in Figure 1. Such a system would ingest data from myriad sources (e.g., patient monitors, ML algorithms) and generate tailored textual outputs for several critical use cases, such as automatically populating Electronic Health Record (EHR) charts, providing concise shift handover summaries for clinical teams, and delivering empathetic, understandable updates to parents.
However, developing and validating such a specialized LLM-based system for the NICU presents two fundamental challenges. First, training these models requires vast amounts of high-quality, domain-specific data, but access to sensitive neonatal patient data is rightly restricted. Second, for summaries to be clinically useful, the system must go beyond simply reporting data; it needs to interpret those data in the context of established clinical norms and in a way that is tailored for the specific user.
To address these challenges, we introduce an AI framework that differs from existing work, which has largely focused on summarizing static, text-only documents such as clinical notes or scientific literature [7,8,10], or capturing verbal interactions between clinicians and patients [11]. The novelty of our approach is fourfold. First, our work is specifically tailored to the unique environment of neonatal intensive care. Second, it features a multi-endpoint summarizer that generates distinct, audience-specific summaries for clinical handovers, automated charting, and parent-friendly updates. Third, where many emerging “AI Scribe” systems summarize spoken or written clinician-patient communications [11], the NPSS fuses data from multiple modalities and sources beyond text. Finally, our framework directly tackles the data scarcity issue by coupling the NPSS with a high-fidelity NICU Patient Simulator. By first generating realistic synthetic data, we are able to develop and validate a system capable of interpreting and summarizing dynamic patient status.
The first component of our framework is a NICU Patient Simulator that generates realistic vital sign data using Generative Adversarial Networks (GANs) [12,13] and also discrete intervention events. Most existing simulators either do not simulate interventions at all, or permit only operator-initiated simulated actions (e.g., neonatal resuscitation manikins) [14,15,16]. In contrast, our simulator employs a hybrid approach, using a GAN framework for continuous vital sign data and an independent Poisson-based multivariate point process to model discrete intervention events. We implement this event generator with a covariance structure that explicitly models the correlation between care events, with time of day, and with overall patient health, to better reflect realistic co-occurrence patterns in NICU practice [6]. This hybrid design enables the simulator to capture both the temporal dynamics of vital signs and the interdependency of discrete clinical actions, producing richer, practice-consistent training and evaluation data.
The second system is the NPSS itself, a suite of summarization tools built on a Retrieval-Augmented Generation (RAG) pipeline. The RAG architecture is essential because it dynamically grounds the LLM in factual, domain-specific knowledge. For any given prompt, it automatically retrieves the most relevant information from an external knowledge base, such as reference documents detailing healthy neonatal vital sign ranges [17], before generating a summary. This enables the NPSS to provide crucial context (e.g., flagging a vital sign as “elevated”) without hard-coding range information or risking ungrounded LLM “hallucinations”. Furthermore, by using specific prompt engineering for each target use case, this single pipeline is adapted to serve several key use cases: generating empathetic updates for parents, creating concise handovers for clinicians, and producing structured data for automated charting.
The primary contributions of this work are as follows:
  • Development of a GAN-driven NICU patient simulator capable of producing realistic neonatal respiration rate (RR) and heart rate (HR) data, optionally with arrhythmia. The simulator also generates realistic clinical and routine care events, accounting for correlations between events. Together with user-entered patient post-conceptual age, sex, and weight, these data are exported as a JSON data structure suitable for ingestion by the NPSS.
  • A modular LLM-based summarization pipeline (NPSS) is developed using RAG for clinical accuracy and contextual relevance across multiple use cases.
  • Validation of the summarization tools using synthetic patient care and health status data generated by the simulator. Groundedness and relevance are measured using a LLM-as-a-judge approach, across three different judge LLM model architectures.
  • Demonstration of practical NPSS use cases, including nurse-to-nurse handovers, automated charting, and real-time parental communication. User-specific language tone and technical detail are observed across each use case.
Ultimately, this work aims to alleviate the clinical documentation burden, improve communication, and enhance parental engagement in the NICU.

1.1. Background and Related Work

Modern NICUs are data-intensive environments, representing a unique intersection of high-frequency physiological monitoring, stringent documentation standards [18], and emotionally charged communication with parents. This vast data volume, coupled with the need for real-time decision-making, places a substantial cognitive and administrative load on clinical staff. Studies have shown that NICU nurses can spend upwards of half of their time on documentation-related duties, reducing time for direct patient care [11]. Current EHR systems provide centralized access to patient records but often lack intelligent summarization, prioritization, or context-aware filtering. As a result, interpreting patient status over time remains a cognitively demanding task, especially during handovers or emergency interventions [19,20].
While EHRs provide centralized data access, they often lack the intelligent summarization and context-aware filtering needed to quickly interpret patient status over time. To manage this data overload, various Clinical Decision Support Systems (CDSS) have been developed. In pediatric and neonatal intensive care, these systems often manifest as dashboards designed to improve situation awareness and resource management. For instance, researchers at Sainte Justine’s Children’s Hospital have explored dashboards for visualizing patient status across a pediatric intensive care unit to support resource management during pandemics [21] and systems to structure data for better decision-making in patient care [22]. Our work differs in its primary objective: not to create a dashboard for visual analysis, but to produce natural language text summaries of patient status and care, driven by data from ML subsystems and patient monitors, suitable for diverse communication endpoints, including clinician handovers, automated charting, and parental updates.
Simulated patient data are essential for medical education, clinical testing, and AI system validation. Many existing neonatal simulators focus on procedural and resuscitation training through physical manikins. For example, the NeoNatalie Live manikin is used for ventilation-focused training, providing data-driven feedback on clinical skills [14], while other high-fidelity manikins allow teams to practice hands-on interventions for various neonatal conditions [15,16]. Beyond these physical models, several digital simulators are also available, such as ResusMonitor for real-time vital sign display [23], and Body Interact and Shadow Health patient simulators for scenario-based clinical reasoning [24,25]. Our approach differs fundamentally in its goal: to create a generative NICU patient simulator capable of producing realistic, multi-day streams of vital sign and intervention data, specifically for the purpose of developing and testing the NPSS.
To create the rich time-series vital sign data needed for our summarizer, we turned to generative models [26,27]. A Generative Adversarial Network (GAN) consists of two neural networks—a generator and a discriminator—that are trained simultaneously in an adversarial process [12]. The generator creates synthetic data samples, while the discriminator’s objective is to distinguish these synthetic samples from real data. Through these competitive training objectives, the generator learns to produce data that are indistinguishable from the real data on which it was trained [12]. Originally trained for image generation, GANs have been developed for multivariate time series generation [13], and specifically for (adult) vital sign generation [27].
We are not aware of any publicly available neonatal dataset that provides continuous, long-term vital sign information along with diagnoses and annotated clinical intervention events. Existing datasets tend to provide only short recordings of patient vital signs and, importantly, often explicitly avoid any time period during which critical care interventions or routine care are being administered, since this would complicate vital sign estimation systems. In fact, this lack of complex patient data motivated our creation of the neonatal patient simulator: we wish to create input data for developing and evaluating the NPSS, along with gold standard annotations of what actually transpired during the 8-h period to be summarized.
Recent advances in LLMs have enabled the generation of realistic clinical text, offering potential for automating healthcare documentation [11]. However, ensuring factual accuracy remains a significant hurdle. LLMs can “hallucinate,” producing plausible yet incorrect information, a risk that is unacceptable in safety-critical applications, such as neonatal care [28].
To address this, many systems now use RAG, which enhances LLMs by grounding generation in external, domain-specific knowledge sources. RAG was formalized by Lewis et al. as an approach that combines pretrained generative models with dense retrieval mechanisms for more factual and contextually grounded outputs [29]. In healthcare-specific contexts, Neha et al. identify EHR summarization as a prevalent application, highlighting RAG’s potential to leverage curated knowledge to reduce hallucination risks while minimizing omissions and clinical inaccuracies in discharge notes and patient records [30].
RAG can retrieve pre-encoded, relevant reference knowledge from the vector store, and embed that knowledge into the system prompt before any text is generated. In our work, we adopt a similar architecture to ensure that summaries referencing neonatal vital signs are grounded in clinically validated norms (see Section 2).

Research Gaps Motivating This Study

The current state-of-the-art presents a clear opportunity and a distinct gap. On one hand, AI Scribe initiatives are emerging to automate the summarization of spoken clinical interactions, with several reviews highlighting progress in speech-to-text documentation and ambient clinical intelligence [11]. AI Scribes are seeing rapid adoption, with one 2025 study reporting 15,791 h of documentation time saved across 2.5 million doctor-patient encounters through the adoption of AI Scribe technology, while both doctors and patients reported more positive interactions [31]. In parallel, a suite of machine learning algorithms can now predict numerous aspects of patient status (vital signs, interventions, patient segmentation, environmental variables, etc.) from sensor data. A framework for summarizing these diverse, multimodal data streams into coherent text summaries suitable for different audiences (parents, nurses, EHRs) is notably absent. Furthermore, no extant patient simulator is capable of generating the realistic, temporally-rich neonatal vital sign and discrete event data required to rigorously develop and validate such a summarization system. This work addresses these specific gaps by creating a modular framework where a high-fidelity GAN-based simulator produces the necessary data to develop and validate a multi-purpose, RAG-enhanced LLM patient status summarization pipeline.

2. Materials and Methods

This section outlines the methodologies employed in the design and development of the two core components of this research: the NICU Patient Simulator and the NPSS. These systems were developed as distinct modules, with the simulator providing a robust platform for generating realistic neonatal data, which subsequently serves as input for the development and validation of the LLM-based NPSS.

2.1. NICU Patient Simulator Development

The NICU Patient Simulator was engineered to generate synthetic, temporally coherent neonatal patient data, simulating an 8-h clinical shift with data points at 5-min intervals. The development focused on realistic vital sign generation and discrete intervention event modeling.

2.1.1. GAN Architecture and Training

The simulator uses separate, dedicated GANs to generate realistic time series of (i) healthy RR, (ii) healthy HR, and (iii) arrhythmia HR episodes (bradycardia and tachycardia). This modular approach to HR simulation was chosen because the modeling requirements for normal vital signs are fundamentally different from those for pathological cardiac events. The baseline GAN produces a continuous, stochastic baseline signal exhibiting typical mean and variance for the entire simulated period. The pathological GAN simulates episodic periods of bradycardia and tachycardia behavior, with abrupt deviations characteristic of these events (e.g., HR dropping to 50–90 bpm for bradycardia, rising to 140–210 bpm for tachycardia [32]). When a pathological condition is selected, the specific GAN generator for that disease state is activated to directly generate the corresponding pathological vital sign time series.
The baseline GAN generator uses a simple feedforward network with an input layer of 10 nodes (the random noise vector), followed by a hidden layer of 64 nodes, and a second hidden layer of 32 nodes, before outputting a single vital sign value (1 node). The full architecture is: Linear( 10 64 ) → ReLU → BatchNorm → Linear( 64 32 ) → ReLU → Linear( 32 1 ) → Sigmoid.
The pathological GAN (used for Bradycardia and Tachycardia) is both deeper and wider. It begins with an input layer of 10 nodes and uses successive hidden layers of 128 nodes, 256 nodes, and 96 nodes, respectively. The full architecture is: Linear( 10 128 ) → ReLU → Linear( 128 256 ) → ReLU → Linear( 256 96 ) → ReLU → Linear( 96 1 ) → Sigmoid. The increased capacity of this network is required to model the complex, non-periodic patterns of arrhythmia.
The discriminator architecture for both GANs is an MLP structured to receive a single vital sign sample and output a probability score [33]. The model takes a single value (size 1) as input, followed by hidden layers of 32 nodes and 16 nodes, before outputting a single scalar value (size 1) representing the probability of the input being a real sample. The architecture is: Linear( 1 32 ) → ReLU → Linear( 32 16 ) → ReLU → Linear( 16 1 ) → Sigmoid.
For both the baseline and pathological models, the generator receives a 10-node random noise vector and outputs a single, denormalized vital sign value (e.g., one HR sample). As the architecture is a non-recurrent MLP, the generation of a time series (e.g., 5 min of data) requires running the generator multiple independent times to produce one sample every second. Each output sample is therefore statistically independent of the previous one, meaning the model captures the marginal distribution of the vital signs but not the temporal dynamics (i.e., the smooth progression of the time series). To enhance clinical realism, all generated time-series vital sign data underwent light smoothing using a Gaussian filter ( σ = 2 ) to eliminate physiologically implausible, high-frequency fluctuations.

2.1.2. Data Sources and Preprocessing for GAN Training

The MLP-GANs [33] used for “healthy” vital sign synthesis were trained using data from the “Preterm Infant Cardio-Respiratory Signals Database” on PhysioNet [34]. Due to the scarcity of neonatal arrhythmia data, adult bradycardia and tachycardia GAN training data were sourced for the “arrhythmia” GANs from the “Model for Simulating ECG and PPG Signals with Arrhythmia Episodes” dataset [35]. It is important to note that training the arryhthmia GANs using this use of adapted adult data represents a methodological limitation, as the underlying physiological patterns of arrhythmia may differ between adults and neonates [36].
To adapt the adult arrhythmia data for neonatal physiology, a normalization and scaling method was employed. The adult beats-per-minute (BPM) sequences were first normalized to a [0, 1] range using min-max normalization. Following GAN training, the synthetic data were denormalized to physiologically relevant neonatal ranges, as follows. For baseline HR and RR, data were normalized and then mean-shifted to a healthy range (110–150 BPM [32] and 30–100 breaths/min, respectively). For the bradycardia model, the generated HR data were scaled to a target range of 50–90 BPM [37], while for the tachycardia model, the data were scaled to 140–210 BPM [38]. This process ensured that the temporal patterns learned from the adult arrhythmia data were preserved while adapting the absolute HR values to reflect neonatal cardiac events.
All models were trained using Binary Cross-Entropy (BCE) loss and the Adam optimizer, with a learning rate of 0.0002 and a batch size of 16. The healthy HR/RR and Tachycardia/Bradycardia generators were trained for 5000 epochs.

2.1.3. Intervention Modeling

The simulator models both scheduled and unscheduled interventions to reflect the complexities of NICU care. Scheduled interventions are those that are pre-planned and occur at set times, such as medication administration, feeding, and routine patient assessments. Unscheduled interventions, in contrast, are reactive events that are initiated by a change in the patient’s condition, such as administering oxygen in response to a desaturation event.
The occurrence of unscheduled interventions is governed by a multivariate Poisson process [39], where the probability of an event k, P ( k ) , is based on the rate of occurrence, λ . A separate average rate ( λ i ) is dynamically computed for each distinct type of unscheduled intervention. The probability of any specific intervention occurring by a given time t is calculated based on its respective rate: P t = 1 e λ t . This was implemented using numpy’s random.poisson(rate) where rate represents the average number of events in a given interval.
These rates are dynamically adjusted based on the time of day to reflect clinical workflows, such as reducing non-essential interventions at night. Interventions are modeled as instantaneous, discrete events marked by a single time stamp, meaning interventions do not have explicit durations.
Some clinical interventions increase the likelihood of other interventions, such as the clinical practice of “bundle care,” where multiple interventions are performed during a single patient interaction to minimize sleep disruptions. The need to model dependencies (covariance) between multiple count-based events derived from a Poisson framework is a common problem in statistical modeling [40]. To capture such relationships, the simulator models the likelihood of co-occurring events using a covariance matrix, Figure 2, that defines the probability of one intervention type occurring, given another. Mathematically, the covariance matrix provides a set of scaling factors used to compute the instantaneous rate λ i ( t ) for any intervention i, making the process conditional on all other interventions active at time t. The adjusted rate is calculated as:
λ i ( t ) = λ base , i × F Time ( t ) × m a x 0 , 1 + j i Cov i j · I j ( t )
where λ base , i is the base rate, F Time ( t ) is the time-of-day scaling factor, Cov i j is the covariance value between intervention i and j, and I j ( t ) is an indicator function (1 if intervention j is active, 0 otherwise) [39]. This approach allows the model to capture event dependencies: if a positive correlation ( Cov i j > 0 ) exists between Intervention A and active Intervention B, the base rate λ A is temporarily increased. Conversely, negative correlations ( Cov i j < 0 ) are used to model inhibited concurrent events. These covariance values were established based on clinical practice literature to reflect common care routines [41]. For example, a high correlation exists between “lighting adjustment” and “family visitation” to reflect that lights are often turned up when family is present, whereas a negative correlation was used for invasive interventions (e.g., imaging) to reflect their unlikely occurrence during a family visit. This system is designed to be configurable to match the specific policies and procedures of a given NICU.
Ultimately, the simulator outputs a structured JSON file for each simulated 8-h patient shift. Each JSON entry corresponds to a 5-min interval, containing a timestamp, vital signs (HR, RR, temperature; at 1 Hz), user-entered clinical data (post-conceptual age, sex, weight), and any interventions occurring during that 5-min interval.

2.2. Text Summarization Pipeline

The NPSS was developed to process patient data into summaries tailored for nurses and parents. In an actual deployment, the input data would come from a combination of patient monitoring sensors and ML-based predictions. For the purpose of developing and validating the NPSS, data were sourced from our Patient Simulator. The system leverages LLMs within a RAG framework, developed using LangChain [42], to enhance factual grounding and mitigate hallucinations.
The fundamental RAG pipeline, depicted in Figure 3, orchestrates the generation of context-aware summaries using several modules from LangChain. The process begins when a user input prompt is received. This input triggers the Memory-Augmented Conversational RAG chain, which manages the entire workflow. The chain first engages the Chat History Manager to pull the relevant conversation history from the Chat History Database. Concurrently, a Context Aggregator initiates the retrieval of information from two distinct vectorized sources: one containing the patient’s clinical data and the other containing external reference documents defining normative vital signs.
The retrieved patient data, the vital sign context, and the chat history are then consolidated in the Combine and Fill System Prompt step. This step dynamically populates a structured prompt template with all necessary information. The complete prompt is then passed to the LLM QA Execution stage, where the language model (OpenAI’s o3-mini) processes the context-rich input and generates a response. This response is finalized and, in a critical feedback loop, saved to the Chat History Database to maintain memory for subsequent interactions. Finally, the answer is returned to the user. This architecture ensures that each response is grounded in both the specific patient’s data and broader clinical knowledge while maintaining conversational coherence.
Patient data, structured as JSON files, are ingested using LangChain’s JSONLoader. The data is then segmented into semantically coherent chunks using a RecursiveCharacterTextSplitter, with an empirically determined optimal chunk size of 5000 characters.

2.2.1. Embedding & Retrieval

Textual chunks are converted into vector embeddings using OpenAI’s text-embedding-ada-002 model and stored in an InMemoryVectorStore. To address the distinct structural differences between the source data streams (i.e., JSON patient data vs. textual reference data), we implemented a dual-retrieval strategy with hyperparameters tailored to each. Retrieval was performed using cosine similarity without an additional reranking step.
The first retrieval pathway, managed by a History-Aware Retriever, handles patient data. For this data source, maintaining temporal context was critical for identifying trends over the 8-h shift. Consequently, we utilized a large chunk size of 5000 characters with a 200-character overlap (using LangChain’s RecursiveCharacterTextSplitter [43]) and retrieved the top-5 ( k = 5 ) most relevant chunks. This ensures the LLM has access to a comprehensive, temporally coherent view of the patient’s recent history.
The second retrieval pathway fetches relevant excerpts from reference material [17]. This domain required high-precision lookup of specific facts, such as normative vital sign ranges. For this index, we employed a much smaller chunk size of 250 characters with no overlap and a retrieval limit of k = 1 . This granular chunking strategy minimizes noise, ensuring that the retrieved context is strictly limited to the specific clinical definition or threshold requested by the query, thereby reducing the risk of hallucinations derived from irrelevant adjacent text. Retrieved information from both sources is then supplied to the LLM.
While the pipeline is designed to accommodate various pre-trained models (including open-source alternatives), o3-mini was used for this implementation, primarily due to its strong ability to follow instructions provided in a system prompt. The retrieval hyperparmeters for the NPSS are detailed in Table 1.

2.2.2. Prompt Engineering & Summarization Modules

Custom system prompts were engineered to guide the LLM’s behavior for each specific use case, resulting in four distinct modules tailored to different clinical needs. For parent-facing updates, the Parent Update Tool generates empathetic and simplified patient updates. Its system prompt explicitly instructs the LLM to “provide a compassionate update … clear, reassuring, and easily understandable by non-medical individuals”. In this way, the model is guided to avoid clinical jargon. This tool produces draft messages that would be vetted by a clinician prior to releasing the response to the parent.
In contrast, for the modules intended for clinical staff, including the Nurse Shift Summarizer and the Interactive Nurse Chatbot tool, the system prompt is designed for technical precision, instructing the model to “Provide the incoming nurse with all pertinent information … in a concise, technical summary”. The LLM was directed to use clinical terminology, such as “bradycardia” or “tachypnea”, when patient vitals fall outside the retrieved normative ranges. The Interactive Nurse Chatbot tool leverages conversational history to allow for iterative querying on specific events or trends.
Finally, the Auto-Charter module is designed for documentation of patient care and status in an EHR; its prompt uniquely instructs the LLM to generate a structured JSON report, enforcing adherence to clinical terminology and precise categorization of abnormalities. The prompt specifies a detailed sample JSON schema with fields like “Associated_symptoms,” “Interventions,” and “Progress_notes,” and commands the model to produce “no commentary, diagnosis, or extra text—only the JSON object”. This structured output is suitable for automatically populating an entry in an EHR system, following human verification.
This role-specific prompting ensures that, for parent updates, the model emphasizes empathy and plain language, while for clinical handovers and charting, it prioritizes concise technical language and structured reporting.

2.3. Experimental Validation

A comprehensive validation protocol was established to assess the functionality, reliability, and clinical alignment of both the NICU Patient Simulator and the NPSS. The validation for each system focused on distinct criteria appropriate to its intended function, with the simulator’s realism being vital for its role in downstream summarizer testing, and the summarizers being evaluated for accuracy, coherence, and domain appropriateness.

2.3.1. Simulator Evaluation Methodology

The NICU Patient Simulator’s outputs were evaluated for physiological realism and temporal coherence against real-world neonatal data sourced from the previously mentioned PhysioNet databases [34,35]. The primary objective was to ensure that the synthetic data, representing an 8-h NICU shift, was clinically realistic and suitable for subsequent use in validating the summarization tools. Validation was performed across several categories of generated vital signs: normal HR, normal RR, tachycardia episodes, and bradycardia episodes.
Distributional Similarity Metrics
To quantify the similarity between the distributions of synthetic and real vital sign data, two primary statistical tests were employed. These metrics were chosen to provide a comprehensive comparison of both the shape and distance between the data distributions. First, the Kolmogorov-Smirnov (KS) Test, a non-parametric test, was used to compare the empirical cumulative distribution functions (ECDFs) of the synthetic HR and RR data against reference Physionet datasets. The KS test assesses whether two samples are likely drawn from the same underlying distribution; a higher p-value indicates a lack of statistically significant difference [44]. Additionally, the Wasserstein Distance, also known as the Earth Mover’s Distance, was used to quantify the minimum “cost” required to transform one probability distribution into another. A lower Wasserstein distance signifies a closer alignment between the synthetic and reference distributions, providing an intuitive measure of similarity [45].
Summary Statistics
Standard descriptive statistics were calculated for both synthetic and reference datasets to compare their central tendency, dispersion, and shape. These included the mean and standard deviation (STD) to ensure that synthetic vital signs fell within expected physiological ranges and exhibited comparable variability to real data. Furthermore, skewness and kurtosis were used to assess the asymmetry and the “tailedness” (propensity for extreme values) of the data distributions, respectively, providing insights into the shape of the generated data compared to references.
Arrhythmia Simulation
The simulator’s ability to replicate cardiac arrhythmia, namely tachycardia and bradycardia, was assessed by examining these episodic events from the generated data. For these episodes, the following characteristics were compared against reference data: the total count of episodes, the mean duration of episodes in timesteps (calculated as the average across all extracted episodes), and the mean maximum BPM (for tachycardia) or mean minimum BPM (for bradycardia) (also averaged across all extracted episodes). Distributional similarity for episode duration and max/min BPM was also assessed using the KS test.

2.3.2. NPSS Validation Framework

The NPSS was evaluated using a structured, automated multi-judge validation framework designed to assess clinical relevance and factual faithfulness of generated summaries. Because all NPSS tools (i.e., the Parent Update Tool, Nurse Shift Summarizer, Interactive Nurse Chatbot, and Auto-Charter) share the same underlying RAG architecture, validation was conducted using the Interactive Nurse Chatbot interface as a representative implementation.
A total of 60 test runs were performed; 10 repetitions of six prompt-based experiment types: (1) generating a full 8-h handover summary; (2) listing all bradycardia episodes; (3) listing all tachycardia episodes; (4) identifying respiratory abnormalities; (5) constructing an intervention timeline; and (6) evaluating vital sign trends against established neonatal reference ranges [17].
Each experiment type was validated against simulator-generated ground truth data to ensure objective assessment.
Automated evaluation was implemented using the LangSmith platform. Rather than employing traditional string-matching evaluators (e.g., QAEvaluator), which are unsuitable for free-form clinical summarization tasks, the evaluation prioritized semantic alignment and factual consistency.
Two custom evaluators were defined under an LLM-as-judge paradigm [46,47], wherein an auxiliary LLM independently scores the NPSS output according to predefined criteria: Relevance, which quantifies the degree to which the generated response directly addresses the clinical prompt and remains focused on the requested task, and Groundedness, which quantifies the extent to which statements in the generated summary are factually supported by retrieved source context. For each test run, all RAG-retrieved context—including both simulated patient data and external neonatal reference material [17]—was programmatically aggregated and provided to the judge model. This metric serves as a structured assessment of factual faithfulness and hallucination risk.

3. Results

This section presents the quantitative findings from the validation of the NICU Patient Simulator and the NPSS, as outlined in the Experiment Validation methodology.

3.1. NICU Patient Simulator Results

The simulator’s performance was evaluated based on its ability to generate realistic vital signs for normal neonatal physiology and specific pathological conditions. The results are derived from an 8-h simulation for each physiological state (normal, tachycardia, bradycardia, etc.).

3.1.1. Synthetic Vital Sign Realism

The synthetic vital sign data for normal HR and RR were compared against reference neonatal data [34] using distributional similarity metrics and summary statistics.
As shown in Table 2, the synthetic HR data closely matched the reference data. The mean HR was 128.4 BPM (real) versus 127.0 BPM (synthetic), with standard deviations of 12.7 and 13.5, respectively. The Kolmogorov-Smirnov (KS) test yielded a p-value of 0.320, and the Wasserstein distance was 2.30, indicating no statistically significant difference between the distributions and a close alignment. Skewness (0.15 vs. 0.12) and kurtosis (3.10 vs. 2.85) were also comparable. Ideally, we would expect a KS statistic near zero, a KS p-value approaching 1, and a Wasserstein distance close to zero, which would indicate that the real and synthetic distributions are nearly indistinguishable. While our observed p-value is not close to 1, it is well above the conventional threshold of 0.05, meaning we cannot reject the null hypothesis that the real and synthetic data are derived from the same distribution. In practical terms, this implies that the small differences observed may be explained by random variation rather than a systematic difference in the underlying distributional shape. This close statistical alignment is visually confirmed in Figure 4, which plots the synthetic and real time-series data directly.
Similarly, as shown in Table 3, synthetic RR data demonstrated high fidelity. This is visually confirmed in Figure 5, which plots the synthetic and real time-series data directly.

3.1.2. Simulation of Arrhythmia

The simulator’s capability to generate realistic arrhythmia was evaluated using both tachycardia and bradycardia episodes. Validation compared the frequency, duration, and range of episodes between real and synthetic data. Over the 8-h observation period, the synthetic model generated 59 tachycardia episodes compared to 57 in the reference data (7.38 vs. 7.13 episodes/h) (Table 4). The mean tachycardia episode duration was 15.85 time steps (synthetic) versus 15.77 (real), while the mean maximum BPM was 213.15 (synthetic) versus 213.59 (real). KS tests for episode duration ( p = 0.7850 ) and maximum BPM ( p = 0.6822 ) indicated strong alignment between real and synthetic distributions.
For bradycardia, the model generated 45 episodes compared to 42 in the simulated patient data (5.63 vs. 5.25 episodes/h) (Table 5). The mean episode duration was 10.6 time steps (synthetic) versus 10.2 (real), and the mean minimum BPM was 77.9 versus 78.4. KS tests for episode duration ( p = 0.7125 ) and minimum BPM ( p = 0.6534 ) further confirmed no significant statistical difference between distributions.

3.2. NPSS Results

The NPSS was evaluated for its ability to generate accurate, audience-specific summaries from synthetic NICU patient data. Using the Interactive Nurse Chatbot interface, simulated patient profiles were generated for each test case, producing structured JSON files containing vital signs and clinical events across an 8-h shift. The NPSS was prompted to summarize these cases, and outputs were compared against simulator-derived ground truth data.
Across all test runs, the NPSS achieved consistently high Relevance scores ( 0.95 0.99 ), indicating that the generated summaries reliably captured the clinically salient aspects of each scenario and remained focused on the user’s informational needs. Groundedness scores ranged from 0.80 to 0.91 , reflecting strong factual alignment between the generated summaries and the underlying simulator data. Given the complexity of summarizing clinical information into free-form natural language, these scores indicate a high level of faithfulness and a low propensity for unsupported content. Collectively, these results indicate that the NPSS produces summaries that are both on-topic and largely faithful to simulator ground truth, warranting further examination of judge-specific scoring behavior. The distribution of groundedness scores across judges is illustrated in Figure 6, highlighting both central tendency and variability in scoring behavior.
Visual inspection of Figure 6 reveals distinct scoring behaviors across judges. Llama-3 exhibits the narrowest interquartile range and the smallest standard error (Table 6), indicating highly consistent scoring across runs. Mistral demonstrates greater dispersion, as reflected by both a wider interquartile range and a larger standard error ( ± 0.017 ). The o3-mini judge displays the highest median groundedness score overall but also the largest variability ( ± 0.024 ), including several lower outlier values. This pattern suggests a generally higher self-assessment tendency accompanied by occasional stricter evaluations in specific cases.
Furthermore, to assess the reliability of these generative metrics and mitigate potential self-preference bias, we compared the evaluation scores across three distinct LLM judges: the generator model itself (o3-mini) and two independent models (Llama-3 and Mistral). As summarized in Table 6, we observed a statistically significant effect of judge model choice on groundedness scores. Specifically, the o3-mini judge assigned higher groundedness scores ( 0.91 ± 0.024 ) to its own generated summaries compared to the external judges ( 0.80 and 0.82 , respectively). This indicates that groundedness evaluations are sensitive to judge model selection, whereas relevance remains comparatively robust across architectures.
Relevance scores exhibited no statistically significant difference across judges, reinforcing the stability of topical alignment assessment across model families.
A key capability of the NPSS is its use of prompt engineering to tailor outputs for different audiences. Figure 7 shows two examples of generated text from the same simulated patient, demonstrating the contrast between a technical summary for a nurse and an empathetic update for a parent. Of course, any generated text would have to be vetted by clinical staff before being released to a parent or included in an EHR.

3.3. Ablation Study: Impact of RAG

To quantify the incremental value of the RAG architecture and address the necessity of external clinical knowledge, we conducted an ablation study comparing the full NPSS against a non-RAG baseline. In this configuration, the LLM was provided with a simulated patient’s raw vital sign data (JSON) but was denied access to the external reference material. The simulated case represented a preterm neonate with bradycardia episodes.
We performed 30 test runs focused on clinical validity: 20 runs specifically targeting the identification of bradycardia episodes and 10 runs generating general patient summaries for the nurse summary use case. The non-RAG baseline demonstrated a critical failure in clinical reasoning. Without access to the specific neonatal reference standards [17], the model hallucinated incorrect vital sign thresholds, thereby failing to identify any of the true bradycardia episodes. Without access to the neonatal-specific reference material, the NPSS instead frequently cited a generic heart rate threshold of 100 bpm.
In regard to the generated summaries, the non-RAG baseline accurately described “fluctuations” in the HR and RR data but failed to interpret them as pathological events. Confirming that, while the LLM possesses general medical knowledge, the retrieval of specific, authoritative context—such as the breakdown of normal vital signs by age group provided in [17]—is essential for accurate clinical decision support in the NICU.

3.4. Compute Resources Used in This Study

The simulator GAN models were trained using an NVIDIA 3090 GPU server. Since inference is far less computationally demanding than training, once trained, the simulator can generate new patient vital sign (HR, RR) and discrete clinical intervention event data with very basic hardware. However, the same 3090 GPU server was used to generate all simulated patient data. The NPSS leverages OpenAI LLMs (GPT4 was investigated, but ultimately o3-mini was used since it better adhered to instructions). These OpenAI models are run through an API (i.e., the model itself runs on an OpenAI cloud server). The o3-mini LLM took approximately 5–10 s to generate text summaries for each use case (parent update, nurse handover, etc.) when given eight hours of simulated patient vital sign and intervention data; it used an approximately equal time to judge the generated text. Local LLMs (LLAMA and Mistral) were investigated as alternative LLM-as-a-judge models. These models ran locally on a server with an NVIDIA 5080 GPU with 16 GB VRAM. These models take only a few seconds to judge a summary generated by o3-mini.

4. Discussion

This study successfully developed and validated two complementary AI-driven systems aimed at addressing critical challenges in NICU data management and communication: a GAN-based NICU Patient Simulator and a suite of RAG-based LLM Patient Status Summarization tools. The results indicate strong potential for these systems to enhance clinical workflows, improve data realism for research and training, and facilitate better communication with parents.
The NICU Patient Simulator demonstrated a fair degree of realism in generating synthetic neonatal vital signs, including normal physiological patterns and specific pathological episodes. The efficacy of the GANs in capturing complex physiological dynamics (i.e., periods of arrhythmia) is supported by statistical validation. For instance, the models for normal heart rate and bradycardia episodes showed strong alignment with reference data, supported by KS-test p-values of 0.320 and 0.7125, respectively. The Wasserstein distance for normal heart rate was 2.30, a value indicating close distributional similarity in the context of physiological signals that vary over a wide range. The model for tachycardia also showed strong alignment with reference data (KS test p-value of 0.7850), confirming the simulator’s ability to replicate the complex, non-linear patterns of this condition. However, it is noted that the simulator does not currently account for ongoing and evolving diagnoses nor does vital sign data respond to simulated clinical interventions.
Ensuring stable and physiologically plausible GAN outputs required careful preprocessing of source data, including scaling of adult arrhythmia data, and iterative tuning of GAN architectures. While Gaussian smoothing mitigated some artifacts, the vital sign data generated by the GAN model within the neonatal patient simulator may not exhibit temporal coherence beyond aggregate statistics (e.g., mean, variance, skew, kurtosis, and other distribution-level metrics). For the arrhythmia cases, we have shown that the frequency and duration of simulated episodes closely matched the reference data, indicating that the simulated vital sign data exhibit temporal realism (to a point). While the simulator was sufficient for the purpose of developing the NPSS, greater temporal coherence may be required for other downstream uses of the neonatal patient simulator. In that case, temporal metrics, such as RMSE of derivatives, should be measured to ensure event transition realism. Furthermore, the current simulator focuses on a core set of vital signs; expanding this to include additional parameters, such as SpO2 and blood pressure, will increase its comprehensiveness. Moving to a multivariate GAN would better capture complex relationships between individual vital signs.
The RAG-based LLM summarization pipeline showed promising results in generating accurate and relevant clinical summaries. The high Groundedness (0.91) and Relevance (0.99) scores for the Interactive Nurse Chatbot (Table 6) indicate that the system can reliably extract and present information pertinent to user queries, adhering closely to the source data [48]. The strong Recall score (0.91) further suggests that critical clinical details are unlikely to be omitted.
The development of distinct summarizer modules (Parent Update Tool, Nurse Shift Summarizer, Interactive Nurse Chatbot, Auto-Charter) tailored to specific end-users is a significant strength. This customization, primarily achieved through meticulous prompt engineering and output structuring, allows the system to address diverse communication needs—from empathetic, simplified updates for parents to concise, technical shift summaries for nurse handovers, and structured data for EHRs. The modules were designed to achieve high usability and appropriateness of tone, an objective that will be formally assessed in future work.
Despite the strong quantitative performance observed across automated evaluation metrics, the summarization system is not without limitations. Although groundedness scores were consistently high (0.80–0.91 across judges), the variation observed between LLM evaluators highlights the inherent subjectivity of generative model assessment. In particular, differences in groundedness scoring across judges suggest that factual faithfulness evaluations may be sensitive to the choice of evaluation model. While relevance scores remained uniformly high (0.95–0.99), indicating reliable topical alignment with clinical prompts, groundedness remains the more stringent and variable metric.
Importantly, the evaluation framework itself relies on an LLM-as-judge paradigm, which, although widely adopted [46,47], is not equivalent to formal clinical validation. While groundedness scoring serves as a structured proxy for hallucination risk and factual consistency, it does not eliminate the need for future validation against real-world clinical documentation workflows. Further research into enhanced retrieval strategies, tighter grounding constraints, and potential post-generation verification mechanisms may help further reduce the risk of unsupported statements in high-stakes medical contexts [49].

4.1. Clinical Implications & Contributions

The decoupled nature of the simulator and summarizer offers distinct advantages. The neonatal patient simulator provides a valuable tool for validating not only our summarization pipeline but potentially other AI-driven healthcare applications, reducing reliance on sensitive real patient data for initial development. Conversely, the NPSS, can and should be validated on actual patient data and clinician-generated gold-standard summary text before any clinical deployment.
This work contributes to the growing body of research on generative AI in healthcare. While prior work demonstrated proof-of-concept summarization from static, single-point-in-time data [6], our NICU Patient Simulator generates continuous HR and RR data streams over entire shifts. This enables temporally-aware and role-specific text summarization [50,51]. Automated charting could reduce documentation errors, while streamlined handover summaries could enhance patient safety by ensuring consistent and accurate information transfer. The Parent Update Tool addresses an important, often overlooked, aspect of NICU care: providing timely, understandable, and empathetic communication to families during a stressful period.

4.2. Limitations of the Study

A formal evaluation of the NPSS to assess factors such as accuracy, clarity, and tone, has not yet been formally conducted with end-users; this is a critical next step. The simulator, while validated against real data, represents a proof-of-concept model, currently only simulating arrhythmia and not other disease states. Furthermore, it does not include diagnoses that are expected to evolve over the course of a patient’s stay in the NICU, nor does it respond to clinical interventions. The simulator is currently trained on short-period data from Physionet; transitioning to long-term neonatal patient monitoring data will lead to increased realism as such data become available. The NPSS was developed and validated using synthetic data; transitioning to real-world EHR data will introduce challenges such as handling missing data and adapting to site-specific documentation practices [50]. Finally, the ethical implications of deploying LLMs for clinical documentation and communication require ongoing consideration, particularly regarding data privacy, potential for bias, and accountability for errors [52,53].

4.3. Future Directions

Building upon the current systems, several avenues for future work are identified to further enhance their capabilities and clinical applicability.
Future iterations of the simulator will focus on increasing its physiological comprehensiveness and dynamic responsiveness. A multivariate GAN will integrate additional vital signs, such as oxygen saturation (SpO2), blood pressure, and glucose levels, to provide a more holistic representation of neonatal health status. The use of Recurrent Conditional GANs may better capture dependencies between different vital signs over time [27], while larger multivariate and neonatal-specific training data, as in [54], will lead to more realistic vital sign generators. The lack of patient datasets containing vital signs with concurrent annotations of interventions and diagnoses represents a challenge to developing truly realistic neonatal patient simulators. Intervention modeling will also be refined by incorporating patient-specific contexts like gestational age and prior medical history, allowing for increased realism. The duration of discrete care events can also be modeled with a simple extension to the simulator. The reference materials (from [17]) provided to the NPSS provide basic HR ranges for infants falling within relatively coarse ranges. However, HR is known to vary with gestational age, postnatal age, and sleep/wake/agitation state. This motivates future extension of the simulator to include both post-conceptual and post-natal age, and also to simulate the sleep/wake/agitation state of the infant.
Currently, the simulation of vital signs and discrete clinical interventions is not linked. Future work will evolve the generator into a dynamic, closed-loop system where vital signs respond causally to simulated interventions. This represents a significant architectural shift from the current GAN-based approach and would likely require moving to a framework based on reinforcement learning or recurrent state-space models. Such a system could more accurately model the complex feedback loops of patient deterioration and recovery [55]. Finally, improvements to the user interface, including real-time plotting and live preview capabilities, will be pursued for continuous refinement and improved usability.
Future work on the NPSS will focus on strengthening factual robustness and deployment safety within high-stakes NICU environments. Although automated evaluation demonstrated consistently high relevance (0.95–0.99) and strong groundedness (0.80–0.91 across three judges), the observed variability in groundedness scoring highlights the need for additional system-level safeguards beyond prompt engineering and retrieval refinement.
Several automated mitigation strategies are planned. First, schema validation mechanisms will be incorporated to ensure that all structured outputs (e.g., event counts, intervention timelines, vital sign summaries) strictly conform to expected clinical data formats and simulator-derived ground truth patient data. Second, contradiction detection modules will be integrated to compare generated statements against the underlying structured data and flag discrepancies prior to output presentation. Third, citation-to-data linking will be implemented, requiring generated clinical statements to be explicitly traceable to specific segments of retrieved patient data or reference material, thereby increasing transparency and auditability.
In addition, tighter RAG constraints and retrieval precision tuning will be explored to further improve factual grounding consistency. The existing NICU Patient Simulator and automated multi-judge evaluation pipeline provide a controlled testbed for systematically evaluating such enhancements. This infrastructure can also be used to benchmark future LLM releases within the neonatal clinical summarization domain.
Ultimately, while human oversight remains essential in clinical contexts, future iterations of the NPSS will prioritize layered automated guardrails to reduce reliance on post hoc review and increase reliability for real-world deployment. True clinical deployment of the NPSS will also contend with evolving regulations in this space, similar to those faced by emerging AI Scribe technologies.
This study has used an LLM-as-a-judge approach to evaluate the quality of generated patient summaries, given the input patient data and the retrieved reference materials. LLMs have been shown to be highly effective judges exhibiting strong correlation with human ratings; however, they can suffer from several forms of bias, especially when the same model is used for both generator and judge [56]. The next critical step for the NPSS is to conduct a formal user study, where clinical or parent end-users assess generated text for accuracy, tone, clarity, and completeness. When crafting the system prompt for each use case (parent update, nurse shift summarizer, etc.), we consulted with both clinicians and parents of children in the NICU to ensure that an appropriate tone and level of detail were achieved. However, a formal evaluation by actual end-users is required to confirm the quality of the generated text. In addition to text accuracy and tone, the benefit in terms of workload should also be quantified, as in ref. [31].
In addition to validating the NPSS using simulated patient data, an additional validation step will involve using de-identified data from actual NICU patients, along with clinician-generated text summaries. Such a study will permit direct comparison between LLM-generated and clinician-generated summaries, following best practices for evaluation of LLM-generated text [57].
A significant long-term goal is the integration of these systems, particularly the Auto-Charter module, with existing EHR infrastructures. This will require use of healthcare interoperability standards such as HL7/FHIR, while also navigating the significant practical hurdles of hospital IT governance, data security protocols, and clinical workflow integration [58]. Lastly, the current system relies on OpenAI’s models; transitioning to local, open-source LLMs could address cost and data privacy concerns, though this presents its own challenges regarding model performance and resource requirements [59].

5. Conclusions

This project successfully developed and validated a novel framework for generating descriptive text from neonatal patient vital sign and clinical event data. With further development and validation, this system may address critical needs in NICU documentation, clinical handovers, and parental communication. The first component of this framework, a GAN-driven NICU Patient Simulator, was created to address the challenge of data scarcity: there is a lack of publicly available neonatal patient vital sign data with associated diagnoses and annotated interventions. By integrating statistical methods, the simulator demonstrated considerable fidelity in generating realistic neonatal vital signs and dynamic intervention patterns, providing a robust and safe platform for developing and testing AI-driven healthcare applications. The practical benefit of the neonatal patient simulator is to enable the development of the NPSS and other patient status and care summarization systems. Developing and validating such systems requires large quantities of data for which we know the true answer (patient age, sex, presence of arrhythmia, which interventions took place during the shift, etc.). The simulator provides a mechanism to generate such data.
Source code for the NICU Patient Simulator and the NPSS system prompts for each use case are available at https://github.com/JesseLevine727/NICU_PatientSimulator (accessed on 2 March 2026).
The second component, the NPSS, was designed to ingest these complex vital sign time-series and discrete-event data to generate descriptive text tailored for multiple endpoints. Using a RAG framework to ground its output in the simulated patient data and reliable reference data, the NPSS exhibited strong performance in producing accurate, relevant, and contextually appropriate text for clinicians, parents, and EHRs.
While this integrated approach is promising, challenges related to GAN training stability and the potential for LLM hallucinations were encountered and partially mitigated, highlighting areas for continued research. Crucially, the practical application of such a system requires a human-in-the-loop approach to ensure the veracity and clinical appropriateness of all generated text before it is used for charting or communicated to parents. These safety considerations, alongside the paramount ethical issues surrounding generative AI in sensitive medical settings, must guide future development.
Clinical significance is achieved by the NPSS, which, once fully validated, could alleviate the clinical documentation burden experienced by clinical staff. The ability to digest, summarize, and synthesize several hours of patient status and care events into text for various audiences would permit clinical staff to spend more time on care and less time on documentation. The use of AI Scribes for summarizing spoken exchanges between doctors and patients is growing rapidly because both clinicians and patients benefit. The NPSS is a next step that focuses on data generated by patient monitors and machine learning algorithms (e.g., video-based intervention detection, vital sign estimation, patient segmentation, etc.) rather than spoken exchanges. Continued development and rigorous real-world validation testing will be vital to translating these promising research findings into effective clinical practice.

Author Contributions

Conceptualization, J.R.G.; methodology, J.L., G.R. and J.R.G.; software, J.L. and G.R.; validation, J.L., G.R. and J.R.G.; resources, J.R.G.; writing—original draft preparation, J.L. and G.R.; writing—review and editing, J.R.G. supervision, J.R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Grant number RGPIN-2021-04184.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to all data being simulated or sourced from a public repository (Physionet).

Data Availability Statement

The original data presented in the study are openly available in PhysioNet [34,35]. These data were accessed in March 2025.

Acknowledgments

This research builds upon proof-of-concept research by Toyin Adams.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Liu, Z.; Huang, B.; Lin, C.L.; Wu, C.L.; Zhao, C.; Chao, W.C.; Wu, Y.C.; Zheng, Y.; Wang, Z. Contactless Respiratory Rate Monitoring For ICU Patients Based On Unsupervised Learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada; IEEE: Piscataway, NJ, USA, 2023; pp. 6005–6014. [Google Scholar] [CrossRef]
  2. Zeng, Y.; Yu, D.; Song, X.; Wang, Q.; Pan, L.; Lu, H.; Wang, W. Camera-based cardiorespiratory monitoring of preterm infants in nicu. IEEE Trans. Instrum. Meas. 2024, 73, 1–13. [Google Scholar] [CrossRef]
  3. Dosso, Y.S.; Kyrollos, D.; Greenwood, K.J.; Harrold, J.; Green, J.R. NICUface: Robust neonatal face detection in complex NICU scenes. IEEE Access 2022, 10, 62893–62909. [Google Scholar] [CrossRef]
  4. Dosso, Y.S.; Aziz, S.; Nizami, S.; Greenwood, K.; Harrold, J.; Green, J.R. Video-based neonatal motion detection. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); IEEE: Piscataway, NJ, USA, 2020; pp. 6135–6138. [Google Scholar] [CrossRef]
  5. Hajj-Ali, Z.; Dosso, Y.S.; Greenwood, K.; Harrold, J.; Green, J.R. Depth-Based Intervention Detection in the Neonatal Intensive Care Unit Using Vision Transformers. Sensors 2024, 24, 7753. [Google Scholar] [CrossRef] [PubMed]
  6. Souley Dosso, Y.; Greenwood, K.; Harrold, J.; Green, J.R. RGB-D Scene Analysis in the NICU. Comput. Biol. Med. 2021, 138. [Google Scholar] [CrossRef] [PubMed]
  7. Al Nazi, Z.; Peng, W. Large Language Models in Healthcare and Medical Domain: A Review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
  8. Huang, Z.; Chen, X.; Wang, Y.; Huang, J.; Zhao, X. A survey on biomedical automatic text summarization with large language models. Inf. Process. Manag. 2025, 62, 104216. [Google Scholar] [CrossRef]
  9. Xia, T.C.; Bertini, F.; Montesi, D. Large Language Models Evaluation for PubMed Extractive Summarisation. ACM Trans. Comput. Healthc. 2026, 7, 1–23. [Google Scholar] [CrossRef]
  10. Nerella, S.; Bandyopadhyay, S.; Zhang, J.; Contreras, M.; Siegel, S.; Bumin, A.; Silva, B.; Sena, J.; Shickel, B.; Bihorac, A.; et al. Transformers and large language models in healthcare: A review. Artif. Intell. Med. 2024, 154, 102900. [Google Scholar] [CrossRef]
  11. Van Buchem, M.M.; Boosman, H.; Bauer, M.P.; Kant, I.M.; Cammel, S.A.; Steyerberg, E.W. The digital scribe in clinical practice: A scoping review and research agenda. npj Digit. Med. 2021, 4, 57. [Google Scholar] [CrossRef]
  12. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  13. Yoon, J.; Jarrett, D.; van der Schaar, M. Time-series Generative Adversarial Networks. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Nice, France, 2019; Volume 32. [Google Scholar]
  14. Chang, C.; Perlman, J.; Abramson, E. Use of a Novel Manikin for Neonatal Resuscitation Ventilation Training. Children 2022, 9, 364. [Google Scholar] [CrossRef] [PubMed]
  15. Yousef, N.; Moreau, R.; Soghier, L. Simulation in neonatal care: Towards a change in traditional training? Eur. J. Pediatr. 2022, 181, 1429–1436. [Google Scholar] [CrossRef] [PubMed]
  16. Yang, S.Y. Simulation Training Needs of Nurses for Nursing High-Risk Premature Infants: A Cross-Sectional Study. Healthcare 2022, 10, 2197. [Google Scholar] [CrossRef]
  17. Chiocca, E.M. Normal Vital Signs in Infants, Children, and Adolescents. In Advanced Pediatric Assessment, 2nd ed.; Springer Publishing Company: New York, NY, USA, 2016; Appendix A: Normal Vital Signs in Infants, Children, and Adolescents. [Google Scholar] [CrossRef]
  18. Avila-Alvarez, A.; Davis, P.G.; Kamlin, C.O.F.; Thio, M. Documentation during neonatal resuscitation: A systematic review. Arch. Dis.-Child.-Fetal Neonatal Ed. 2021, 106, 376–380. [Google Scholar] [CrossRef]
  19. Gesner, E.; Dykes, P.C.; Zhang, L.; Gazarian, P. Documentation burden in nursing and its role in clinician burnout syndrome. Appl. Clin. Inform. 2022, 13, 983–990. [Google Scholar] [CrossRef]
  20. Cohen, G.R.; Friedman, C.P.; Ryan, A.M.; Richardson, C.R.; Adler-Milstein, J. Variation in physicians’ electronic health record documentation and potential patient harm from that variation. J. Gen. Intern. Med. 2019, 34, 2355–2367. [Google Scholar] [CrossRef]
  21. Boudreault, L.; Hebert-Lavoie, M.; Ung, K.; Mahmoudhi, C.; Vu, Q.P.; Jouvet, P.; Doyon-Poulin, P. Situation Awareness-Oriented Dashboard in ICUs in Support of Resource Management in Time of Pandemics. IEEE J. Transl. Eng. Health Med. 2023, 11, 151–160. [Google Scholar] [CrossRef]
  22. Yakob, N.; Laliberté, S.; Doyon-Poulin, P.; Jouvet, P.; Noumeir, R. Data Representation Structure to Support Clinical Decision-Making in the Pediatric Intensive Care Unit: Interview Study and Preliminary Decision Support Interface Design. JMIR Form. Res. 2024, 8, e49497. [Google Scholar] [CrossRef]
  23. ResusSim. ResusMonitor: Online Patient Monitor Simulator. 2024. Available online: https://resusmonitor.com/ (accessed on 6 September 2025).
  24. Padilha, J.M.; Machado, P.P.; Ribeiro, A.; Ramos, J.; Costa, P. Clinical Virtual Simulation in Nursing Education: Randomized Controlled Trial. J. Med. Internet Res. 2019, 21, e11529. [Google Scholar] [CrossRef]
  25. Elsevier. Shadow Health: Digital Clinical Experiences. 2021. Available online: https://www.shadowhealth.com/ (accessed on 6 September 2025).
  26. Festag, S.; Denzler, J.; Spreckelsen, C. Generative adversarial networks for biomedical time series forecasting and imputation. J. Biomed. Inform. 2022, 129, 104058. [Google Scholar] [CrossRef] [PubMed]
  27. Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv 2017, arXiv:1706.02633. [Google Scholar] [CrossRef]
  28. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
  29. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuksa, P.; Minervini, P.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
  30. Neha, F.; Bhati, D.; Shukla, D.K. Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review. AI 2025, 6, 226. [Google Scholar] [CrossRef]
  31. Tierney, A.A.; Gayre, G.; Hoberman, B.; Mattern, B.; Ballesca, M.; Wilson Hannay, S.B.; Castilla, K.; Lau, C.S.; Kipnis, P.; Liu, V.; et al. Ambient Artificial Intelligence Scribes: Learnings after 1 Year and over 2.5 Million Uses. NEJM Catal. 2025, 6, CAT–25. [Google Scholar] [CrossRef]
  32. Ernstmeyer, K.; Christman, E. Nursing Skills. Open Resources for Nursing (Open RN) [Internet]. Table 1.3b, “Normal Heart Rate by Age”. 2021. Available online: https://www.ncbi.nlm.nih.gov/books/NBK593193/table/ch1survey.T.normal_heart_rate_by_age/ (accessed on 1 March 2025).
  33. Figueira, A.; Vaz, B. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 2022, 10, 2733. [Google Scholar] [CrossRef]
  34. Gee, A.H.; Barbieri, R.; Paydarfar, D.; Indic, P. Predicting Bradycardia in Preterm Infants Using Point Process Analysis of Heart Rate. IEEE Trans. Biomed. Eng. 2017, 64, 2300–2308. [Google Scholar] [CrossRef]
  35. Sološenko, A.; Petrėnas, A.; Paliakaitė, B.; Marozas, V.; Sörnmo, L. Model for Simulating ECG and PPG Signals with Arrhythmia Episodes (Version 1.3.0). RRID:SCR_007345. 2021. Available online: https://physionet.org/content/ecg-ppg-simulator-arrhythmia/1.3.0/ (accessed on 1 March 2025).
  36. Wang, Y.; Xu, H.; Kumar, R.; Tipparaju, S.M.; Wagner, M.B.; Joyner, R.W. Differences in transient outward current properties between neonatal and adult human atrial myocytes. J. Mol. Cell. Cardiol. 2003, 35, 1083–1092. [Google Scholar] [CrossRef] [PubMed]
  37. Hasenstab-Kenney, K.A.; Bellodas Sanchez, J.; Prabhakar, V.; Lang, I.M.; Shaker, R.; Jadcherla, S.R. Mechanisms of bradycardia in premature infants: Aerodigestive-cardiac regulatory-rhythm interactions. Physiol. Rep. 2020, 8, e14495. [Google Scholar] [CrossRef] [PubMed]
  38. Kothari, D.S.; Skinner, J.R. Neonatal tachycardias: An update. Arch. Dis. Child.-Fetal Neonatal Ed. 2006, 91, F136–F144. [Google Scholar] [CrossRef]
  39. du Toit, S.H.; Browne, M.W. Structural Equation Modeling of Multivariate Time Series. Multivar. Behav. Res. 2007, 42, 67–101. [Google Scholar] [CrossRef] [PubMed]
  40. Inouye, D.; Yang, E.; Allen, G.; Ravikumar, P. A Review of Multivariate Distributions for Count Data Derived from the Poisson Distribution. Wiley Interdiscip. Rev. Comput. Stat. 2017, 9, e1398. [Google Scholar] [CrossRef] [PubMed]
  41. Héon, M.; Aita, M.; Lavallée, A.; De Clifford-Faugère, G.; Laporte, G.; Boisvert, A.; Feeley, N. Comprehensive Mapping of NICU Developmental Care Nursing Interventions and Related Sensitive Outcome Indicators: A Scoping Review Protocol. BMJ Open 2022, 12, e046807. [Google Scholar] [CrossRef]
  42. LangChain. Build a Local RAG Application. 2024. Available online: https://docs.langchain.com/oss/python/langchain/rag (accessed on 15 August 2024).
  43. Contributors, L. Recursive Text Splitter—LangChain Python Documentation. 2024. Available online: https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter (accessed on 10 February 2026).
  44. Massey, F.J., Jr. The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc. 1951, 46, 68–78. [Google Scholar] [CrossRef]
  45. Spelta, A.; Raffinetti, E. Evaluating SAFE AI principles using Wasserstein distance: A comparative study of Machine Learning models. Statistics 2024, 58, 1283–1303. [Google Scholar] [CrossRef]
  46. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36; Curran Associates, Inc.: Nice, France, 2023; pp. 46595–46623. [Google Scholar]
  47. Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A Survey on LLM-as-a-Judge. arXiv 2025, arXiv:2411.15594. [Google Scholar] [CrossRef]
  48. Es, S.; James, J.; Anke, L.E.; Schockaert, S. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 150–158. [Google Scholar] [CrossRef]
  49. Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Pimenta, D. A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation. medRxiv 2024. [Google Scholar] [CrossRef]
  50. Loni, M.; Poursalim, F.; Asadi, M.; Gharehbaghi, A. A Review on Generative AI Models for Synthetic Medical Text, Time Series, and Longitudinal Data. arXiv 2024, arXiv:2411.12274. [Google Scholar] [CrossRef] [PubMed]
  51. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
  52. Geis, J.R.; Brady, A.P.; Wu, C.C.; Spencer, J.; Ranschaert, E.; Jaremko, J.L.; Langer, S.G.; Borondy Kitts, A.; Birch, J.; Shields, W.F.; et al. Ethics of Artificial Intelligence in Radiology: Summary of the Joint European and North American Multisociety Statement. Radiology 2019, 293, 436–440. [Google Scholar] [CrossRef] [PubMed]
  53. Baig, M.M.; Hobson, C.; GholamHosseini, H.; Ullah, E.; Afifi, S. Generative AI in improving personalized patient care plans: Opportunities and barriers towards its wider adoption. Appl. Sci. 2024, 14, 10899. [Google Scholar] [CrossRef]
  54. Niestroy, J.C.; Moorman, J.R.; Levinson, M.A.; Manir, S.A.; Clark, T.W.; Fairchild, K.D.; Lake, D.E. Discovery of signatures of fatal neonatal illness in vital signs using highly comparative time-series analysis. npj Digit. Med. 2022, 5, 6. [Google Scholar] [CrossRef] [PubMed]
  55. Liu, S.; Ngiam, K.Y.; Feng, M. Deep Reinforcement Learning for Clinical Decision Support: A Brief Survey. arXiv 2019, arXiv:1907.09475. [Google Scholar] [CrossRef]
  56. Chiang, C.H.; Lee, H.y. Can large language models be an alternative to human evaluations? arXiv 2023, arXiv:2305.01937. [Google Scholar] [CrossRef]
  57. Rudd, E.M.; Andrews, C.; Tully, P. A Practical Guide for Evaluating LLMs and LLM-Reliant Systems. arXiv 2025, arXiv:2506.13023. [Google Scholar] [CrossRef]
  58. Mandel, J.C.; Kreda, D.A.; Mandl, K.D.; Kohane, I.S.; Ramoni, R.B. SMART on FHIR: A standards-based, interoperable apps platform for electronic health records. J. Am. Med. Inform. Assoc. 2016, 23, 899–908. [Google Scholar] [CrossRef] [PubMed]
  59. Pan, G.; Chodnekar, V.; Roy, A.; Wang, H. A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services. arXiv 2025, arXiv:2509.18101. [Google Scholar] [CrossRef]
Figure 1. High-level architecture of the proposed Neonatal Patient Status Summarizer (NPSS). A modular Retrieval-Augmented Generation (RAG) LLM system summarizes patient care and health status, given input data from ML subsystems (providing ML-Derived Clinical Inferences, such as patient detection, coverage, movement, clinical interventions) and vital signs. A GAN-driven Neonatal Patient Simulator generates simulated vital sign and discrete clinical care event data to develop and validate the NPSS across diverse use cases, including nurse shift handover summaries, automated charting, and parent communication.
Figure 1. High-level architecture of the proposed Neonatal Patient Status Summarizer (NPSS). A modular Retrieval-Augmented Generation (RAG) LLM system summarizes patient care and health status, given input data from ML subsystems (providing ML-Derived Clinical Inferences, such as patient detection, coverage, movement, clinical interventions) and vital signs. A GAN-driven Neonatal Patient Simulator generates simulated vital sign and discrete clinical care event data to develop and validate the NPSS across diverse use cases, including nurse shift handover summaries, automated charting, and parent communication.
Information 17 00261 g001
Figure 2. Covariance matrix of co-occurring NICU interventions. This matrix models the likelihood of multiple care events occurring simultaneously to reflect realistic NICU practices, such as bundling interventions to reduce infant disturbances and avoiding invasive procedures during a family visit.
Figure 2. Covariance matrix of co-occurring NICU interventions. This matrix models the likelihood of multiple care events occurring simultaneously to reflect realistic NICU practices, such as bundling interventions to reduce infant disturbances and avoiding invasive procedures during a family visit.
Information 17 00261 g002
Figure 3. Text summarization pipeline using memory-augmented conversational RAG. The pipeline processes incoming queries through a context-aware chain that retrieves patient data, clinical reference information, and conversational history to generate grounded, audience-specific responses via the LLM. The workflow includes chat history retrieval, patient and reference document vector search, dynamic prompt construction, LLM response generation, and memory updating.
Figure 3. Text summarization pipeline using memory-augmented conversational RAG. The pipeline processes incoming queries through a context-aware chain that retrieves patient data, clinical reference information, and conversational history to generate grounded, audience-specific responses via the LLM. The workflow includes chat history retrieval, patient and reference document vector search, dynamic prompt construction, LLM response generation, and memory updating.
Information 17 00261 g003
Figure 4. Comparison of synthetic versus real neonatal heart rate (HR) data. The plot demonstrates a close temporal and distributional alignment over the simulated period, supporting the model’s realism.
Figure 4. Comparison of synthetic versus real neonatal heart rate (HR) data. The plot demonstrates a close temporal and distributional alignment over the simulated period, supporting the model’s realism.
Information 17 00261 g004
Figure 5. Comparison of synthetic versus real neonatal respiratory rate (RR) data. Similar to the HR data, the synthetic RR signal effectively captures the dynamic fluctuations of the real-world reference signal.
Figure 5. Comparison of synthetic versus real neonatal respiratory rate (RR) data. Similar to the HR data, the synthetic RR signal effectively captures the dynamic fluctuations of the real-world reference signal.
Information 17 00261 g005
Figure 6. Distribution of groundedness scores assigned by three LLM judges. Boxes represent interquartile ranges (IQR), center lines indicate medians, and whiskers denote extrema.
Figure 6. Distribution of groundedness scores assigned by three LLM judges. Boxes represent interquartile ranges (IQR), center lines indicate medians, and whiskers denote extrema.
Information 17 00261 g006
Figure 7. Audience-specific generated summaries from NPSS for the same simulated patient case: a clinically stable, “normal” male neonate (24 days old, 3.7 kg) with no active diagnoses. Vital signs fluctuate within expected physiologic ranges over the 8-h shift, with intermittent elevations in respiratory rate above the upper reference limit. (A) Technical nurse shift summary; (B) empathetic parent-facing update.
Figure 7. Audience-specific generated summaries from NPSS for the same simulated patient case: a clinically stable, “normal” male neonate (24 days old, 3.7 kg) with no active diagnoses. Vital signs fluctuate within expected physiologic ranges over the 8-h shift, with intermittent elevations in respiratory rate above the upper reference limit. (A) Technical nurse shift summary; (B) empathetic parent-facing update.
Information 17 00261 g007
Table 1. Retrieval hyperparameters for the dual-retriever architecture used in NPSS.
Table 1. Retrieval hyperparameters for the dual-retriever architecture used in NPSS.
ParameterPatient Data RetrieverReference Material Retriever
Chunking StrategyRecursiveCharacterTextSplitterRecursiveCharacterTextSplitter
Chunk Size5000 characters250 characters
Chunk Overlap200 characters0 characters
Retrieval MethodCosine SimilarityCosine Similarity
Top-k51
Embedding Modeltext-embedding-ada-002text-embedding-ada-002
Vector StoreInMemoryVectorStoreInMemoryVectorStore
RerankingNoneNone
Table 2. Validation metrics for normal heart rate.
Table 2. Validation metrics for normal heart rate.
MetricReal DataSynthetic Data
Mean (BPM)128.4127.0
Standard deviation (STD)12.713.5
Skewness0.120.15
Kurtosis2.853.10
KS statistic/p-value/WD0.085/0.320/2.30
Table 3. Validation metrics for respiration rate.
Table 3. Validation metrics for respiration rate.
MetricReal DataSynthetic Data
Mean (BPM)127.26126.86
Standard deviation (STD)10.8112.29
Skewness0.23 0.50
Kurtosis 0.09 1.05
KS statistic/p-value/WD0.168/0.118/3.761
Table 4. Validation metrics for tachycardia episodes.
Table 4. Validation metrics for tachycardia episodes.
MetricReal DataSynthetic Data
Count of episodes (8 h)5759
Frequency (episodes/h)7.137.38
Mean duration (timesteps)15.7715.85
Mean maximum BPM213.59213.15
KS statistic/p-value (duration)0.081/0.785
KS statistic/p-value (max BPM)0.093/0.682
Table 5. Validation metrics for bradycardia episodes.
Table 5. Validation metrics for bradycardia episodes.
MetricReal DataSynthetic Data
Count of episodes (8 h)4245
Frequency (episodes/h)5.255.63
Mean duration (timesteps)10.2010.60
Mean minimum BPM78.4077.90
KS statistic/p-value (duration)0.094/0.713
KS statistic/p-value (min BPM)0.089/0.653
Table 6. Average evaluation scores (mean ± standard error) for different LLM judge models.
Table 6. Average evaluation scores (mean ± standard error) for different LLM judge models.
Judge ModelGroundednessRelevance
o3-mini 0.91 ± 0.024 0.99 ± 0.006
Llama3 0.80 ± 0.002 0.97 ± 0.009
Mistral 0.82 ± 0.017 0.95 ± 0.020
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Levine, J.; Riarh, G.; Green, J.R. Generative Simulation and Summarization of Neonatal Patient Data. Information 2026, 17, 261. https://doi.org/10.3390/info17030261

AMA Style

Levine J, Riarh G, Green JR. Generative Simulation and Summarization of Neonatal Patient Data. Information. 2026; 17(3):261. https://doi.org/10.3390/info17030261

Chicago/Turabian Style

Levine, Jesse, Gurshan Riarh, and James R. Green. 2026. "Generative Simulation and Summarization of Neonatal Patient Data" Information 17, no. 3: 261. https://doi.org/10.3390/info17030261

APA Style

Levine, J., Riarh, G., & Green, J. R. (2026). Generative Simulation and Summarization of Neonatal Patient Data. Information, 17(3), 261. https://doi.org/10.3390/info17030261

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop