Next Article in Journal
Enhancing User Experiences in Digital Marketing Through Machine Learning: Cases, Trends, and Challenges
Next Article in Special Issue
A Comprehensive Approach to Instruction Tuning for Qwen2.5: Data Selection, Domain Interaction, and Training Protocols
Previous Article in Journal
Adoption Drivers of Intelligent Virtual Assistants in Banking: Rethinking the Artificial Intelligence Banker
Previous Article in Special Issue
Domain- and Language-Adaptable Natural Language Interface for Property Graphs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Leave as Fast as You Can: Using Generative AI to Automate and Accelerate Hospital Discharge Reports

by
Alex Trejo Omeñaca
1,2,*,†,
Esteve Llargués Rocabruna
3,†,
Jonny Sloan
1,†,
Michelle Catta-Preta
1,†,
Jan Ferrer i Picó
1,4,*,†,
Julio Cesar Alfaro Alvarez
3,†,
Toni Alonso Solis
3,†,
Eloy Lloveras Gil
3,†,
Xavier Serrano Vinaixa
3,†,
Daniela Velasquez Villegas
3,†,
Ramon Romeu Garcia
3,†,
Carles Rubies Feijoo
3,†,
Josep Maria Monguet i Fierro
1,4,† and
Beatriu Bayes Genis
3,†
1
Innex Labs, Carrer de Tarragona 10, 08800 Vilanova i la Geltrú, Spain
2
Departament d’Enginyeria Gràfica i Disseny, Escola Politècnica Superior d’Enginyeria de Vilanova i la Geltrú (EPSEVG), Universitat Politècnica de Catalunya, Avinguda de Víctor Balaguer 1, 08800 Vilanova i la Geltrú, Spain
3
Hospital General de Granollers, Avinguda Francesc Ribas s/n, 08402 Granollers, Spain
4
Departament d’Enginyeria Gràfica i Disseny, Escola Tècnica Superior d’Enginyeria Industrial de Barcelona (ETSEIB), Universitat Politècnica de Catalunya, Avinguda Diagonal 647, 08028 Barcelona, Spain
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Computers 2025, 14(6), 210; https://doi.org/10.3390/computers14060210
Submission received: 17 March 2025 / Revised: 15 May 2025 / Accepted: 19 May 2025 / Published: 28 May 2025
(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

Abstract

Clinical documentation, particularly the hospital discharge report (HDR), is essential for ensuring continuity of care, yet its preparation is time-consuming and places a considerable clinical and administrative burden on healthcare professionals. Recent advancements in Generative Artificial Intelligence (GenAI) and the use of prompt engineering in large language models (LLMs) offer opportunities to automate parts of this process, improving efficiency and documentation quality while reducing administrative workload. This study aims to design a digital system based on LLMs capable of automatically generating HDRs using information from clinical course notes and emergency care reports. The system was developed through iterative cycles, integrating various instruction flows and evaluating five different LLMs combined with prompt engineering strategies and agent-based architectures. Throughout the development, more than 60 discharge reports were generated and assessed, leading to continuous system refinement. In the production phase, 40 pneumology discharge reports were produced, receiving positive feedback from physicians, with an average score of 2.9 out of 4, indicating the system’s usefulness, with only minor edits needed in most cases. The ongoing expansion of the system to additional services and its integration within a hospital electronic system highlights the potential of LLMs, when combined with effective prompt engineering and agent-based architectures, to generate high-quality medical content and provide meaningful support to healthcare professionals. Hospital discharge reports (HDRs) are pivotal for continuity of care but consume substantial clinician time. Generative AI systems based on large language models (LLMs) could streamline this process, provided they deliver accurate, multilingual, and workflow-compatible outputs. We pursued a three-stage, design-science approach. Proof-of-concept: five state-of-the-art LLMs were benchmarked with multi-agent prompting to produce sample HDRs and define the optimal agent structure. Prototype: 60 HDRs spanning six specialties were generated and compared with clinician originals using ROUGE with average scores compatible with specialized news summarizing models in Spanish and Catalan (lower scores). A qualitative audit of 27 HDR pairs showed recurrent divergences in medication dose (56%) and social context (52%). Pilot deployment: The AI-HDR service was embedded in the hospital’s electronic health record. In the pilot, 47 HDRs were autogenerated in real-world settings and reviewed by attending physicians. Missing information and factual errors were flagged in 53% and 47% of drafts, respectively, while written assessments diminished the importance of these errors. An LLM-driven, agent-orchestrated pipeline can safely draft real-world HDRs, cutting administrative overhead while achieving clinician-acceptable quality, not without errors that require human supervision. Future work should refine specialty-specific prompts to curb omissions, add temporal consistency checks to prevent outdated data propagation, and validate time savings and clinical impact in multi-center trials.

1. Introduction

1.1. Hospital Discharge Reports, a Necessary Resource Blackhole

Hospital discharge reports (also known as discharge summaries, HDRs in this paper) are vital for ensuring continuity of care when patients transition from hospital to home or other care settings [1]. They serve as the primary communication tool conveying a patient’s hospital course, diagnoses, treatments, and follow-up plan to the next healthcare providers or even as a base for continuity in care in transnational collaboration [1]. Studies show that having a discharge summary available at the first post-hospital visit can significantly impact patient outcomes [2,3]. For example, when a discharge summary was available to the outpatient physician, one study observed a trend toward 26% lower readmission risk [4].
The HDR is generated based on clinical course notes and emergency department reports. The clinical course compiles all information recorded by various physicians throughout the patient’s hospital stay, including progress notes, diagnostic test results, and clinical observations [1,4,5]. These data are contributed by multiple professionals, often in diverse styles and formats, resulting in lengthy and complex documents that can be challenging to manage. A multi-hospital study on heart failure patients exemplified this variability: even the highest-performing hospital’s discharge summaries were deemed insufficient in timeliness, transmission, and content, indicating room for improvement across the board [6]. Common deficiencies include missing key details (such as incomplete medication lists, absent test results, or vague follow-up instructions) and lack of clarity in the information provided [4]. The quality can depend on individual physician practices; without a standardized template, some may omit elements that others include. This inconsistency poses risks to patient safety and continuity [7], as the receiving provider cannot always count on a uniform or coherent set of data.
Beyond the quality challenges, preparing discharge summaries can be time-consuming, and inefficiencies in this process may lead to delays in patient discharge or communication gaps. The task of compiling all relevant information at discharge contributes to physician workload; one analysis noted that physicians spend roughly 2 h on documentation (like writing notes and summaries in the electronic health record) for every 1 h of direct patient care [8]. Furthermore, the administrative burden associated with manually preparing these documents can hinder workflow efficiency in clinical practice [9] and lead to burnout by clinicians [10]. Automating parts of this process can alleviate administrative workload, allowing clinicians to focus more on patient care [10]. A significant reduction in documentation time can enhance physician satisfaction and improve interactions with patients [11].

1.2. The Potential of Leveraging AI for HDR Generation

Digital technologies have been employed for some time to automate and enhance the accuracy and efficiency of clinical documentation [12] and clinical support systems [13]. Recent advancements in AI, particularly the development of large language models (LLMs), have shown the ability to generate human-like text and have been explored for clinical documentation purposes [13]. These models can assist in summarizing patient interactions and ensuring that critical clinical information is accurately captured and easily communicated [14]. Moreover, healthcare professionals consider these models especially useful in supportive roles, such as drafting preliminary documents, aiding in decision-making [15], or making information more understandable [14,16].
Nonetheless, automation of HDR generation has historically been challenging due to the necessary steps in the generation process: extraction and abstraction [8]. While extraction focuses more on gathering the existing data from the hospital stay and summarizing it, abstraction requires a more creative approach and capacity to make sense of the data to produce a coherent document. Most artificial intelligence cases before the popularization of GenAI focused on extraction—and did so without very promising results—leaving the professionals with the burden of adding the information into the HDR and producing the final document [8]. And even after the rise of GenAI popularity, up until recently, data intake constrains of GenAI models acted as a bottleneck in the development of automated clinical data processing systems and services [17].
Yet nowadays GenAI is particularly adept at processing larger amounts of text according to predefined guidelines and producing appealingly sound arguments, especially when given a very specific and uncreative task that involves a limited set of data, such as that of generating an HDR [8,18,19,20,21,22]. Given the possibility for a standardized structure of an HDR [1,23,24], their strong correlation with clinical progression, and the necessity of managing extensive textual data from medical records, modern GenAI with advanced reasoning capabilities is well suited to automating this essential yet labor-intensive task without risking quality.
Several studies have demonstrated the potential of GenAI in this domain. Janota and Janota [19] utilized GenAI to generate psychiatric discharge reports. A blind review conducted by professionals revealed that GPT-4 (one of OpenAI’s GenAI model) produced summaries that were more coherent and structured than those developed by clinicians. However, concerns regarding accuracy and reliability necessitated professional oversight. Similarly, Clough et al. [20] investigated the feasibility of AI-generated HDRs using ChatGPT (https://chatgpt.com/, accessed on 15 May 2025) (late 2023 to early 2024) on 25 synthetic cases. Their findings indicated that GPT-generated documentation was of equivalent quality to that produced by junior doctors [20]. Further improvements in HDR accuracy were achieved by Pal et al. through the use of fine-tuned models targeting specific report sections, demonstrating the feasibility of automatically generating precise HDRs from nurses’ notes when restricting GenAI involvement to selected parts of the report [5].
Although most research on GenAI in HDR generation focuses on assessing output quality and its potential to partially replace human labor, the underlying motivation for such implementations is the promise of increased resource efficiency [20]. In a case study on orthopedic discharge documents, researchers found that ChatGPT generated reports in one-tenth of the time required by a human while maintaining equivalent quality and exhibiting fewer hallucinations [21]. This significant reduction in time spent on documentation suggests notable economic savings while allowing healthcare professionals to dedicate more time to direct patient interactions.
Nevertheless, existing research is predominantly centered on English-language healthcare services. In many regions, English is neither the primary language of healthcare professionals nor of patients, creating potential disadvantages unless equivalent systems are developed for local languages. Some progress has been made in this area; for instance, as early as 2019, Chinese researchers generated HDRs in Chinese using AI with satisfactory results [25] and more recently researchers proved that generic GenAI services such as ChatGPT also provided close-to-human results in Italian HDRs [22]. However, challenges persist for minor languages. A Peruvian case study indicated that while Spanish-language AI services for HDRs may become feasible, implementation in indigenous languages such as Quechua and Aymara remains unlikely [16].
Finally, none of the identified studies examined the systemic integration of GenAI into hospital software, as most research has focused on feasibility assessments based primarily on GPT within ChatGPT’s interface and using synthetic data. As a result, no existing solutions fully address the outlined challenges in a comprehensive manner within electronic health record (EHR) systems while including privacy protections for real patients. Modern multi-agent approaches, which facilitate collaboration between multiple AI models while structuring data systematically, offer promising capabilities for automating HDR generation. Such advancements suggest that the potential for a fully integrated and effective AI-driven HDR system is substantial.
In this paper, the authors aim to showcase an example of the preclinical testing of a real-life implementation of such solutions where data protection, hallucination control, EHR integration, and multilingual support are considered. The case is sought to be in a real-world preclinical stage; as fully fledged rollout and validations are not yet completed, the solution is still under development, and it will take a few iterations and clinical validations before it is implemented system-wide and its benefits and handicaps can be systematically evaluated. Nonetheless, the tool design already aims towards a ready-to-use tool within the hospital’s EHR, hinting at how GenAI-generated HDRs could be integrated into the clinical workflow, and what iterative evaluation could look like as opposed to previous research that simply proves AI’s ability to perform the task through generic tools and interfaces, and leaving the prompting to the user.

2. Materials and Methods

2.1. Objective

The primary objective of this study is to enable an integrated GenAI solution to generate HDRs within a hospital’s infrastructure with the necessary safeguards and quality control mechanisms for deployment in an operative setting. This application should facilitate the writing process of HDRs by taking a digital clinical course and producing a properly formatted draft that only requires medical validation and minor editing.
This objective involves demonstrating that AI can generate a structured, preliminary, and high-quality hospital discharge summary based on clinicians’ records, which include the patient’s clinical course during hospitalization, results of complementary tests, and emergency department reports (which, as we outlined in the introduction, the literature demonstrates is possible) in a real-life setting and within the established clinical workflows. Such implementation shall also demonstrate that GenAI is suited to offer multilingual support in Catalan and Spanish. Additionally, this study assesses healthcare professionals’ satisfaction with the implementation of this technology.

2.2. Solution Development and Implementation

The widespread adoption of generic GenAI services has significantly lowered the barriers to AI products’ development, enabling rapid and cost-effective design, testing, and production [26,27]. A key factor in this transformation is the availability of base models as services, which shift development efforts from extensive research and technological innovation to a more streamlined process of customization, design, and iterative testing. This shift facilitates an agile design approach, allowing for the rapid iteration of proofs of concept and prototypes. Given the abductive nature of design processes [28], this methodology prioritizes experimental and exploratory approaches over structured research, aligning with grounded theory principles [29]. As a result, both the knowledge generated through this study and the AI-driven HDR service itself emerge as products of an iterative process in which experiments are quickly performed and afterwards evaluated to substantiate conclusions.
To evaluate the feasibility of this approach, the authors adopted an iterative, design-driven methodology [30], implementing the solution within a hospital’s information system for real-world testing. This environment closely resembles a living lab [31], where collaboration among the private sector, public sector, and academia encourages innovation beyond conventional laboratory settings, development, testing, and real-world experiences are prioritized over technical trials.
Following the design cycle—comprising research, ideation, testing, and evaluation—the project evolved through multiple iterations. It progressed from unstructured concept testing to the development of a functional prototype and, ultimately, to a deployable service in the phases shown in Figure 1. This paper presents the process and findings from this iterative development, including the proof of concept, prototype implementation, and an initial limited deployment within the hospital’s information system through the prototype.

2.2.1. Proof of Concept

The initial proof-of-concept phase aimed to identify the optimal prompting strategy and the most suitable LLM for generating HDRs. This phase employed an iterative, exploratory approach, consistent with design-science research methodologies, which emphasize rapid prototyping and iterative refinement to address complex, real-world problems [33]. The process involved testing a range of prompting strategies, from pattern-structured prompting configurations [34] to multi-agent frameworks tailored for specific sub-tasks in least-to-most and maieutic prompting approaches [35,36]. This approach aligns with the principles of design thinking, which advocate for flexibility, non-linear workflows, and iterative ideation to foster innovation [30].
The evaluation focused on generating three HDRs, each corresponding to a distinct medical specialty: cardiology, pediatrics, and general medicine. These specialties were selected based on input from participating clinicians, ensuring relevance and practical applicability. The selection of standard, commonly encountered discharge reports as test cases was informed by the need to establish a baseline for evaluating the system’s performance in routine clinical scenarios [37]. Each HDR generation trial utilized two primary inputs: an anonymized clinical course summary and a clinician-authored HDR, which served as a reference standard for comparison.
The primary objective of this phase was to generate AI-authored HDRs that comprehensively captured essential clinical information while minimizing inaccuracies and adhering to the established structural format defined by the hospital guidelines. Evaluation was conducted through a triangulated subjective assessment involving two developers and a designer [38,39], focusing on the efficiency and feasibility of the prompting strategies rather than exhaustive precision analysis for each trial. This approach was necessitated by the high volume of iterations, which precluded granular evaluation at this stage and the need to balance resources efficiency in development and knowledge production [33]. Subsequent phases (see Section 2.2.2) were planned to focus on system optimization, quality assessment, and formal validation.
Five LLMs were evaluated during this phase (at the date of autumn 2024): Gemini Pro 1.5 (Google, Mountain View, CA, USA), Claude 3.5 Sonnet (Antropic, San Francisco, CA, USA), GPT-4o (OpenAI, San Francisco, CA, USA), Mistral (Mistral, Paris, France), and Llama 3.1 (Meta, Meno Park, CA, USA). Model selection was integrated into the exploration of prompting strategies, guided by five key requirements derived from the clinical context and operational constraints:
  • Multilingual support: Given the bilingual nature of the clinical setting (which might include documents in Catalan and Spanish, often mixed in the same clinical course), the system was optimized for both languages. Language-specific adjustments were implemented to ensure effective comprehension and accurate report generation.
  • Perceived output quality: The variability in clinical courses and discharge reports necessitated a meticulous analysis to capture the full spectrum of linguistic and contextual nuances inherent in medical documentation. In this initial test, assessments were performed by the developers as medical professionals would intervene later on to improve results.
  • Service stability: Continuous availability during testing was a critical operational requirement. Due to recurrent service interruptions, Claude was excluded from further consideration, underscoring the importance of reliability in deploying AI systems in clinical environments.
  • Price: A good balance between cost and effectivity is key in LLM selection; if similar results are obtained with different costs, the cheaper model provides a competitive advantage, while if a high-cost model is perceived to perform more poorly than others, it shall be discarded. Price was calculated based on the average of multiple generations during the testing period.
  • Generation time: While AI can generate documentation much faster than humans, considering the time spent by multiple LLMs is relevant in choosing a good balance between price, speed, and quality. Time was calculated based on the average of multiple generations during the testing period.
These necessary requirements enabled a more structured selection of LLM models, where some performed poorly in Catalan while others failed at times to fulfill the requests, failing to offer a stable service. This early-stage experimentation established a foundation for refining prompting strategies, paving the way for a more structured and rigorous evaluation framework.

2.2.2. Prototyping

The second phase focused on developing a functional prototype capable of generating a higher volume of HDRs beyond single-trial outputs. This advancement aimed to facilitate the model’s repetitive use within a system optimized for backend deployment. The primary objective was not only to scale the model but also to validate the quality of AI-generated HDRs and implement iterative improvements as necessary.
During this phase, 60 HDRs were generated across six medical specialties based on data provided by the hospital: cardiology, pediatrics, general medicine, pneumology, surgery, and gynecology. Of these, 27 were manually validated through qualitative comparative analysis between AI-generated and clinician-authored HDRs. This manual assessment involved a content comparison for each of the HDR requirements. Additionally, all generated HDRs underwent evaluation using the ROUGE test where they were compared to HDRs written by clinicians.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a widely used text evaluation metric in automatic translation and summarizing that assesses the quality of generated text by comparing it to a human-written reference [40,41]. This similarity assessment is based on word overlap (ROUGE-1), bigram overlap (ROUGE-2), and longest common subsequence matching (ROUGE-L) [40]. The ROUGE evaluation criteria included the following:
  • Precision: The generated text is considered accurate if it contains only the original information without introducing extraneous content, that is, the amount of AI-generated text that is present in the reference document.
  • Recall: A high recall score indicates that the generated text captures nearly all relevant content from the reference, that is, the amount of reference text that is present in the AI-generated text.
  • F-score: This measure balances precision and recall, ensuring that the text is both comprehensive and of high quality.
ROUGE results were analyzed by averaging the scores across each specialty and language pair (combination of languages found in the clinical course data). This comparison was critical for refining the model’s design based on the structural and informational characteristics of HDRs within each specialty. F1-F2 scores are also provided as a form of balancing between F1 and F2 scores.
The combined use of ROUGE metrics and partial manual evaluation helped mitigate the expected limitations of ROUGE, particularly its reduced sensitivity to synonym usage and variations in phrase structure [42]. These linguistic variations are common in clinical documentation where differing style might favor more narrative or synthetic approaches to the writing, which can also include non-standardized abbreviations, making qualitative assessment a valuable complement to automated evaluation.

2.2.3. Limited Implementation in a Relevant Setting

The final phase of testing centered on deploying the AI-generated hospital discharge report (HDR) service in a real-world hospital environment. Designed for adaptability and scalability, the AI-HDR system accommodates the diverse technological infrastructures and organizational requirements of the Catalan healthcare network. Functioning as an external module, it seamlessly interfaces with hospital information systems. Upon clinician activation, the hospital software backend encrypts the clinical course and emergency report of the patient and transfers it securely to the AI-HDR tool. User interaction remains straightforward: from the ward’s bed management interface, the physician selects a patient and activates the AI-HDR function (“IAIA” in Catalan), triggering a report generation process that typically completes within about two minutes. The physician then reviews and, if necessary, edits the generated report before recording it in the existing hospital information system (Figure 2). Figure 3 shows an example of the user interface.
Following report generation, physicians evaluated the resulting HDR using binary answers to four parameters:
  • Factual errors;
  • Missing information;
  • Writing/style issues;
  • Correct (only minor edits required).
Additionally, clinicians were asked to provide a description of the issues they encountered with the report generation and the time they spent editing it. The four parameters were analyzed with system percentages given the small sample, and greater focus was given to the written evaluations as they provide greater information for system improvement. Unstructured written data were clustered into emerging groups and where possible groups that matched the original four-parameter classification. In total, six groups of comments emerged, covering factual errors, missing information, style issues, too lengthy or short sections, duplicated data, and outdated data. The number of HDRs with comments on each section is also reported with percentages of the analyzed reports. The differences between the self-reported parameters and those found in the written comments are discussed in Section 4.2. This pilot phase spanned over the whole month of February and partially during March 2025, during which 47 HDRs were generated through AI-HDR and assessed under these criteria. Researchers excluded time from the analysis as only a few clinicians reported their time spent on editing HDRs.

2.3. Ethical and Legal Framework

The implementation of the AI-generated HDR system in the public Catalan healthcare sector presents significant challenges in data protection, ethics, and security [43]. To ensure regulatory compliance, patient protection, and system integrity, key mechanisms have been established across four main areas [43]:
  • Regulatory Compliance: AI HDR generation will adhere to European (GDPR), national (LOPDGDD and Law 41/2002), and regional (Decree 105/2000 and the CatSalut Security Plan) regulations [43]. This includes measures such as data anonymization, processing agreements, and security protocols.
  • Technical Security: High-level protection mechanisms will be deployed, including advanced encryption (TLS 1.3 and AES-256), controlled processing environments, encryption of data at rest, and regular audits to ensure system robustness.
  • Ethical Governance: A multidisciplinary committee will oversee the deployment of AI-generated HDRs to ensure alignment with bioethical principles, transparency in informed consent, and the prevention of algorithmic bias [43].
  • Post-Market Surveillance: Continuous monitoring of system performance will be conducted through the collection of clinical data, incident management, and iterative reviews to enhance accuracy and adapt to evolving regulations [43].
This comprehensive approach ensures that AI-HDR is implemented with security, fairness, and transparency within the Catalan healthcare framework.

3. Results

3.1. Proof of Concept

The initial proof of concept consisted of identifying which tools performed better for the project requirements and could, therefore, deliver quality HDRs against the documentation written by clinicians. Table 1 shows the performance of each model for each selection criteria in order from most suitable to least. The models that were deemed not to perform well during iterative testing were not even considered for multilingual support and service stability, resulting in a lack of data.

3.2. Prototype

During the prototype validation phase, 60 HDRs were generated from six specialties combining multilingual inputs and outputs. The AI-generated outputs were evaluated against reports written by professionals through a ROGUE test. The results of the evaluation are presented in Table 2 for each of the specialties and in Table 3 for each linguistic combination. Language pairs are the combination of languages found in the clinical documentation. All HDRs were generated in the same language as the reference HDR authored by clinicians for accurate evaluation.
General scores are generally higher for R1 (unigrams) than R2 (pairs) and RL (longest combination), which is perfectly normal behavior, as further discussed in detail in Section 4.1. In terms of specialties, pneumology scored remarkably high, while Gynecology and Vascular scores were rather low, but within an acceptable range. Further on, language pairs showed great disparities between different language combinations, yet beyond the Catalan–Spanish combination, single-language documentation and reports offered results within the expected ranges of specially trained summarization models in the respective language. This is again further on discussed in Section 4.1. It is worth highlighting the generally lower scores of precision metrics, especially when compared to the high scores for recall.
Due to the limited analysis possible from ROGUE scores, 27 of the 60 HDRs were further qualitatively analyzed in a comparative approach (between the human- and AI-generated HDR) following the contents of the HDR structure. Divergences were consistently found (Table 4), although HDR pairs were overall highly aligned. Of special relevance in the analysis is the high number of discrepancies (generally errors but not necessary in all cases) in the medication and dosage and in the social and baseline context, which favored human-made reports where data unavailable in the clinical course were found.

3.3. Limited Implementation in a Relevant Setting

The limited implementation in a relevant setting followed a progressive implementation of HDRs in the disciplines of pneumology, surgery, and early testing in Pediatrics. Professionals from the participating hospital in this study were given access to the tool within the system and prompted with a survey for result evaluation (Table 5). Throughout February and March 2025, a total of 47 HDRs were generated. In the evaluation survey, more than half (53.2%) of reports were reported to have missing information, while slightly less than half had some sort of factual errors. Only two reports (4.3%) were correct enough to not include factual errors, missing information, or style issues. It is worth noting that an HDR could be flagged with multiple issues as they are not mutually exclusive—except for the reports that were correct. It is worth noting that these HDRs were never compared to a clinician-written HDR as professionals were using the AI tool to generate the HDR.
Results were slightly different for the qualitative analysis of the comments made by professionals during evaluation (Table 6). In comments, they reported a far lower number of factual errors (19.1%) and 46.8% of cases with missing information. On the other hand, they reported additional issues such as too lengthy or too short explanations of patients’ evolution (12.8%), outdated data presented as current (2.1%), and duplicated data throughout the HDR (6.4%).

4. Discussion

This study demonstrates the feasibility of applying Generative AI models to the automatic generation of HDRs with success in both lab and real-world settings. While the results do not demonstrate that the service is ready for deployment to replace clinician’s tasks—which was never expected—they offer a sound base to accelerate professionals’ workflows. We wish to illustrate some limits to the analysis methodologies and prove that although results might appear significantly negative, the tool is perceived to be beneficial in the professional workflows.

4.1. Analysis of ROUGE Evaluation and Its Limitations

The ROUGE family of metrics—unigram overlap (ROUGE-1), bigram overlap (ROUGE-2), and longest common subsequence (ROUGE-L)—was designed to quantify lexical similarity between a candidate summary and a human reference [42]. In our multilingual clinical setting, this yields a quick, reproducible proxy for content fidelity, yet it also exposes blind spots:
  • Stylistic heterogeneity in clinical notes: Clinical course notes oscillate between terse, checklist-like phrasing and richer narrative prose, often within the same service; these affect the AI-generated contents, while humans have better training at extracting the actually useful information and standardizing its format in a concise manner. Moreover, each physician has their own preferences when it comes to HDR writing styles, which affects the reference documents against which the AI-generated reports are compared. When the reference is highly narrative and the AI output is concise, recall drops and precision rises; the reverse occurs when the AI is more expansive than the reference. Our results with overall higher scores on recall suggest that reference texts are more succinct while AI-generated texts are more detailed in the explanations. This analysis is in fact substantiated by the qualitative evaluation of 27 HDRs performed along with the ROUGE tests. Generally, doctors’ HDRs were more telegraphic and factual, whereas the model tended to elaborate, especially around the daily evolution narrative of the patient, which is an expected behavior of general LLM models when they are provided with superfluous information—as can be the case in hospital clinical courses.
  • Semantic adequacy beyond surface overlap: As ROUGE compares groups of letters (tokens) rather than factual information, harmless synonyms or equivalent expressions (abbreviations or differing descriptive wording) are penalized as false positives/negatives. Complementary semantic metrics such as BERTScore [44] or domain-adapted COMET [45] are recommended for future iterations, and the multilingual FRESA framework has already demonstrated a better correlation with human judgment in Spanish and Catalan [46]. In our case, we opted for a complementary human qualitative review.
Regardless of the mentioned limitations, ROUGE scores suggest positive results when benchmarked against other ROUGE evaluations in prose texts such as news. The PEGASUS model was developed with a corpus of text from CNN/Daily Mail and in English news summarization offers similar scores to our es-es (Spanish) results [47]. The PEGASUS model scored R1-F1 0.44, R2-F 0.22, and RL-F 0.41 [47], while our implementation scored R1-F1 0.43, R2-F 0.22, and RL-F 0.24. This comparison suggests that our performance is aligned with what can be expected from more specialized models. Even when it comes to the ca-ca pairs, although scores are considerable lower, they are also aligned with what can be expected for specialized models in Catalan [48]. Ahuir et al. reported results of between 0.26 and 0.29 for R1-F1, 0.10 and 0.12 for R2-F, and 0.20 and 0.23 for RL-F [48]. To what degree the lower scores for Catalan are due to language structure or lack of training data for the models is unclear to the researchers.
On the other hand, when it comes to the different disciplines, lower ROUGE scores for gynecology and vascular cardiology stem from depressed precision (i.e., the system adds terminology absent from the clinician’s HDR). This hints at discipline-specific narrative conventions: certain services favor tightly templated HDRs, so a general LLM with prompt structures designed for mixed-discipline notes may “over-generate”. Further analysis on differing HDR structures for these disciplines would be required as results are on the lower end of what is perceived as acceptable; rather, a custom processing flow with specific prompts targeting their specific HDR structures should attenuate this mismatch.
In sum, while ROUGE metrics provide a valuable baseline for evaluating content fidelity, they are insufficient on their own. The results, although generally satisfactory, leave room for interpretive ambiguity and require triangulation with qualitative analyses to ensure the accuracy and contextual appropriateness of AI-generated clinical documentation.

4.2. Context of Qualitative Evaluations

The quantitative analyses presented earlier—particularly the ROUGE-based comparison of 60 AI-generated HDRs with clinician originals—show that lexical overlap is, at best, an incomplete proxy for clinical adequacy. This limitation became clear once we submitted a purposive subsample of 27 HDR pairs to close qualitative scrutiny and complemented that exercise with bedside feedback on 47 prospectively generated drafts. The richer evaluation produced two convergent insights: first, that both LLMs and clinicians can introduce facts that have no textual antecedent in the clinical course (“phantom data”); second, that LLMs reproduce values that have been explicitly corrected later in the record, thereby perpetuating errors. Both phenomena explain why clinicians, despite recognizing the productivity gain afforded by the tool, still flagged more than half of the drafts for missing or incorrect information during routine use (53% and 47%, respectively).
The comparative analysis during the prototyping phase aligned each AI draft with its human reference along the standard HDR sections—admission chronology, diagnostics, social context, medications, investigations, and follow-up plan. Divergences were ubiquitous, affecting at least one domain in every pair; the most frequent mismatches concerned social or baseline context (52%), medication details (56%), and admission chronology (44%). In parallel, free-text comments written by ward physicians during a one-month limited deployment revealed a somewhat different hierarchy of concerns: only one-fifth of reports were criticized for factual errors, whereas nearly half were said to lack pertinent information; additional complaints targeted excessive or insufficient narrative length, data duplication, and the presence of outdated values. That is not to say that clinicians misreported factual errors or missing data, rather that the combination of evaluations provides a better picture and that, possibly, missing data or errors were not “critical”, although further analysis would be required to validate this hypothesis. In fact, missing data are not necessarily bad per se or an error.
A striking proportion of human–AI discrepancies arose because professionals added information in the report that was not present in the clinical course, as some authors also noted [49,50], while the AI also added hallucinations, flagging them (at times) for further revision by the professional. These insertions most often concerned social circumstances (e.g., living arrangements and occupational status) that clinicians traditionally dictate from memory or consult in ancillary systems not included in the prompt. From a technical perspective, the model behaves consistently with what is expected from the HDR structure: it predicts that a statement about family context or comorbidity should follow, and therefore fabricates one when the input is silent or makes its own decision on what data are relevant, which is not necessarily aligned with that of the clinician. For example, on a report, the clinician reported the patient to live in a rural area with two cats, while the LLM reported the user to live in a flat without an elevator. While neither were untrue, lack of data in the clinical report and differing assessments generated a discrepancy in the report.
The second recurrent pattern concerned the treatment of explicit corrections in the course. The following example illustrates the mechanism: an early entry listed a weight trajectory, but a note written 25 min later read “Correction: weight is from another patient”. Because the prototype concatenates all daily notes into a single text block, temporal relationships might be missed by the LLM, which copied the obsolete weight into the discharge summary. Similar cascades were documented for antibiotic doses and date-of-admission fields, producing factual errors in the documentation. In our audit, 44% of HDR pairs disagreed on admission chronology and 56% on medication doses, proportions that mirror the frequency of free-text complaints of outdated or duplicated data during bedside use. These propagated errors are particularly insidious because they appear authoritative—the authoring clinician has to be fully aware of the course contents in order to identify these misses as without that information one would not suspect the error.
Two remedial avenues follow directly from these findings. First, prompts should instruct the model to surface uncertainty by leaving placeholders—e.g., “[social context not documented]”—instead of inventing information. The use of agents can be revised to improve the ability to check for factual errors and amend them. Alternatively, the pipeline can be broadened to ingest structured demographics and nursing hand-over notes, bringing the model’s context in line with clinicians’ cognitive workspace. Second, a lightweight information-extraction agent should precede generation, tagging entities with timestamps and detecting corrective phrases (regular expressions for “correction”, “error”, etc.). Such pre-processing would enable the model to privilege the latest mention of a variable, which could flag physically implausible outputs before the draft reaches the editor.
The qualitative strand of this study exposes failure modes that evade standard summarization metrics yet carry significant clinical risk. Nonetheless, it is important to contextualize these errors in the stage of development of the tool and the expected outcome by the clinicians whom in conversations were generally positive regarding the use of the tool. In fact, beyond the specific reported discrepancies the general outcome was of high quality, in many cases more precise than that of the clinicians. From an experiential point of view, the use of the tool reduces their task to fact checking a report and performing some modifications instead of having to study the clinical course and craft a new HDR from scratch.

4.3. Overall Study Limitations and Expansion of the Research

In addition to the specific evaluation methods employed in the advanced stages of service development, several broader limitations emerge from the overall design of this case study. Firstly, the initial design process was inherently iterative and somewhat unstructured—as is often characteristic of early-stage design—which, while beneficial for rapid prototyping and refinement, limited the capacity to objectively determine the most appropriate model. Model selection might have been strengthened through a more detailed, case-by-case qualitative analysis of output quality and multilingual capabilities. However, such an approach was complicated by the evolving nature of prompt structures and agent configurations throughout this study. The system’s progression from a focus on prompt engineering to the integration of agent-based architectures enhanced both output quality and adaptability, yet introduced additional complexity to comparative evaluations.
Advanced prompting strategies—including few-shot learning, chain-of-thought reasoning, and domain-specific knowledge integration—proved useful in identifying effective methodologies for clinical documentation generation. Still, more formalized comparative approaches, such as model pair comparisons inspired by A/B testing, could have yielded more reliable insights. These methods could also be valuable during later stages of deployment, where clinical staff could contribute systematically to iterative service improvements, especially as underlying technologies continue to evolve.
Additionally, the inclusion of metrics such as the number of successful outputs and completed tests would have allowed for a more nuanced and objective assessment of system stability. Expanding the range of models tested, including those suitable for local deployment, may also have been advantageous. While local deployment offers greater institutional sovereignty, the choice to offload AI services to third-party providers in early stages was primarily driven by agility, faster access to model updates, and cost efficiency [51,52]. Nevertheless, fine-tuned small language models, although requiring higher upfront investment in infrastructure and training, could potentially offer comparable or superior performance with reduced long-term operational costs [51,52].
Economic feasibility remains a critical constraint, encompassing costs associated with model usage, server infrastructure, licensing, processing time, post-deployment maintenance, and marketing. A comparative analysis of these expenditures against the costs of human-generated clinical documentation—and the corresponding benefits of freeing clinician time—would enhance the understanding of long-term viability.
This study’s limitations are further evident in its small-scale deployment, involving a single hospital and a limited number of reports per phase. The true effectiveness and scalability of the system will require evaluation across multiple sites and disciplines. Future research should address the challenge of limited clinician participation in system evaluation, despite widespread use and apparent time savings. Even simplified evaluation instruments, such as the four-parameter tool used in this study, encountered participation barriers. For instance, data on the time clinicians spent editing AI-generated hospital discharge reports (HDRs)—though explicitly requested—were rarely submitted. To address this in future large-scale implementations, hospital systems should incorporate automated tracking mechanisms to log the time elapsed between the generation of an AI output and the clinician’s final upload to the electronic health record.

5. Conclusions

The AI-HDR prototype (IAIA in Catalan) illustrates the potential for improving healthcare professionals’ efficiency by automating bureaucratic processes and partially offloading routine documentation tasks to AI systems. This automation could potentially free up time for higher-value clinical activities, although the present study does not assess the direct benefits of AI-generated HDRs. The project’s development required approval and collaboration from three key hospital stakeholder groups, administration, digital system managers, and, most critically, medical professionals, whose acceptance was essential not only for data acquisition but also for its evaluation, practical implementation in real-world settings, and ethical approval.
Despite promising results, several limitations indicate areas for future research. First, adapting the system to languages beyond Catalan and Spanish may pose challenges, particularly in maintaining terminological accuracy. Even within the studied languages, performance assessments revealed a decline in quality when processing Catalan-language documentation, highlighting the need for more robust multilingual solutions. Second, the considerable variability in the format and style of clinical courses and discharge reports across medical specialties suggests that an initial focus on a single specialty may be necessary to optimize system performance. Additionally, the choice of LLM could influence the approach’s generalizability to different healthcare settings. Future work should also prioritize full integration with electronic health records (EHRs) and conduct pilot studies across multiple clinical environments to evaluate the system’s practical impact and reliability.
In conclusion, this study highlights the potential of Generative AI and LLMs as valuable tools for supporting clinical documentation. While large-scale implementation in hospitals is feasible, it requires a tailored approach for each medical specialty to ensure effective HDR design. This research lays the foundation for future advancements in healthcare management, contributing to the optimization of medical workflows and improved efficiency in clinical settings.

Author Contributions

Conceptualization, A.T.O., E.L.R., M.C.-P., J.C.A.A., T.A.S., E.L.G., X.S.V., D.V.V., R.R.G., C.R.F., J.M.M.i.F. and B.B.G.; investigation, A.T.O., J.S. and M.C.-P.; methodology, A.T.O., M.C.-P., J.F.i.P. and J.M.M.i.F.; software, A.T.O., J.S. and T.A.S.; visualization, J.F.i.P. and J.M.M.i.F.; writing—original draft, J.F.i.P. and J.M.M.i.F.; writing—review and editing, J.F.i.P. They ensure that questions related to the accuracy or integrity of any part of the work, even those in which they were not personally involved, are appropriately investigated, resolved, and documented. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Fundació Privada Hospital Asil de Granollers (protocol code: 20243016, version 1.1 dated 10 September 2024; approval date: 24 September 2024).

Data Availability Statement

The datasets used in this study include private and sensitive information (e.g., medical records, personal health information, and data processing structures), which cannot be shared publicly. Please contact the corresponding author for your inquiries on medical or technical information.

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT-4o and o1 for the purposes of structure drafting and text writing assistance in human–machine–human iterations. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

A.T.O., M.C.-P., J.M.M.i.F. and J.S. declare to be partners and hold shares of their affiliation Innex Labs S.L. All other authors declare no conflicts of interest.

References

  1. European Commission. Guidelines on Hospital Discharge Report; European Commission: Budapest, Hungary, 2024. [Google Scholar]
  2. Kripalani, S.; LeFevre, F.; Phillips, C.O.; Williams, M.V.; Basaviah, P.; Baker, D.W. Deficits in Communication and Information Transfer Between Hospital-Based and Primary Care Physicians: Implications for Patient Safety and Continuity of Care. JAMA 2007, 297, 831. [Google Scholar] [CrossRef] [PubMed]
  3. Walraven, C.; Seth, R.; Austin, P.C.; Laupacis, A. Effect of Discharge Summary Availability during Post-Discharge Visits on Hospital Readmission. J. Gen. Intern. Med. 2002, 17, 186–192. [Google Scholar] [CrossRef] [PubMed]
  4. Sakaguchi, F.H.; Lenert, L.A. Improving Continuity of Care via the Discharge Summary. AMIA Annu. Symp. Proc. 2015, 2015, 1111–1120. [Google Scholar]
  5. Pal, K.; Bahrainian, S.A.; Mercurio, L.; Eickhoff, C. Neural Summarization of Electronic Health Records. arXiv 2023, arXiv:2305.15222. [Google Scholar]
  6. Al-Damluji, M.S.; Dzara, K.; Hodshon, B.; Punnanithinont, N.; Krumholz, H.M.; Chaudhry, S.I.; Horwitz, L.I. Hospital Variation in Quality of Discharge Summaries for Patients Hospitalized With Heart Failure Exacerbation. Circ. Cardiovasc. Qual. Outcomes 2015, 8, 77–86. [Google Scholar] [CrossRef]
  7. Were, M.C.; Li, X.; Kesterson, J.; Cadwallader, J.; Asirwa, C.; Khan, B.; Rosenman, M.B. Adequacy of Hospital Discharge Summaries in Documenting Tests with Pending Results and Outpatient Follow-up Providers. J. Gen. Intern. Med. 2009, 24, 1002–1006. [Google Scholar] [CrossRef]
  8. Hartman, V.C.; Bapat, S.S.; Weiner, M.G.; Navi, B.B.; Sholle, E.T.; Campion, T.R. A Method to Automate the Discharge Summary Hospital Course for Neurology Patients. J. Am. Med. Inform. Assoc. 2023, 30, 1995–2003. [Google Scholar] [CrossRef]
  9. Sinsky, C.; Colligan, L.; Li, L.; Prgomet, M.; Reynolds, S.; Goeders, L.; Westbrook, J.; Tutty, M.; Blike, G. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Ann. Intern. Med. 2016, 165, 753–760. [Google Scholar] [CrossRef]
  10. Downing, N.L.; Bates, D.W.; Longhurst, C.A. Physician Burnout in the Electronic Health Record Era: Are We Ignoring the Real Cause? Ann. Intern. Med. 2018, 169, 50–51. [Google Scholar] [CrossRef]
  11. Sloss, E.A.; Abdul, S.; Aboagyewah, M.A.; Beebe, A.; Kendle, K.; Marshall, K.; Rosenbloom, S.T.; Rossetti, S.; Grigg, A.; Smith, K.D.; et al. Toward Alleviating Clinician Documentation Burden: A Scoping Review of Burden Reduction Efforts. Appl. Clin. Inform. 2024, 15, 446–455. [Google Scholar] [CrossRef]
  12. Rosenbloom, S.T.; Denny, J.C.; Xu, H.; Lorenzi, N.; Stead, W.W.; Johnson, K.B. Data from Clinical Notes: A Perspective on the Tension between Structure and Flexible Documentation. J. Am. Med. Inform. Assoc. 2011, 18, 181–186. [Google Scholar] [CrossRef] [PubMed]
  13. Sutton, R.T.; Pincock, D.; Baumgart, D.C.; Sadowski, D.C.; Fedorak, R.N.; Kroeker, K.I. An Overview of Clinical Decision Support Systems: Benefits, Risks, and Strategies for Success. Npj Digit. Med. 2020, 3, 17. [Google Scholar] [CrossRef] [PubMed]
  14. Jeblick, K.; Schachtner, B.; Dexl, J.; Mittermeier, A.; Stüber, A.T.; Topalis, J.; Weber, T.; Wesp, P.; Sabel, B.O.; Ricke, J.; et al. ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports. Eur. Radiol. 2023, 34, 2817–2825. [Google Scholar] [CrossRef]
  15. Spotnitz, M.; Idnay, B.; Gordon, E.R.; Shyu, R.; Zhang, G.; Liu, C.; Cimino, J.J.; Weng, C. A Survey of Clinicians’ Views of the Utility of Large Language Models. Appl. Clin. Inform. 2024, 15, 306–312. [Google Scholar] [CrossRef]
  16. Sahota, P. How We’re Using Generative AI to Support Outpatient Care in Peru. Front. Tech Hub. Available online: https://www.frontiertechhub.org/insights/avatr-generative-ai-learnings-in-project-empatia (accessed on 1 March 2025).
  17. Hartman, V.; Campion, T.R. A Day-to-Day Approach for Automating the Hospital Course Section of the Discharge Summary. AMIA Summits Transl. Sci. Proc. 2022, 2022, 216–225. [Google Scholar]
  18. Zaretsky, J.; Kim, J.M.; Baskharoun, S.; Zhao, Y.; Austrian, J.; Aphinyanaphongs, Y.; Gupta, R.; Blecker, S.B.; Feldman, J. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw. Open 2024, 7, e240357. [Google Scholar] [CrossRef]
  19. Janota, B.; Janota, K. Application of AI in the Creation of Discharge Summaries in Psychiatric Clinics. Int. J. Psychiatry Med. 2025, 60, 330–337. [Google Scholar] [CrossRef]
  20. Clough, R.A.J.; Sparkes, W.A.; Clough, O.T.; Sykes, J.T.; Steventon, A.T.; King, K. Transforming Healthcare Documentation: Harnessing the Potential of AI to Generate Discharge Summaries. BJGP Open 2024, 8, BJGPO.2023.0116. [Google Scholar] [CrossRef]
  21. Rosenberg, G.S.; Magnéli, M.; Barle, N.; Kontakis, M.G.; Müller, A.M.; Wittauer, M.; Gordon, M.; Brodén, C. ChatGPT-4 Generates Orthopedic Discharge Documents Faster than Humans Maintaining Comparable Quality: A Pilot Study of 6 Cases. Acta Orthop. 2024, 95, 152–156. [Google Scholar] [CrossRef]
  22. Ruinelli, L.; Colombo, A.; Rochat, M.; Popeskou, S.G.; Franchini, A.; Mitrović, S.; Lithgow, O.W.; Cornelius, J.; Rinaldi, F. Experiments in Automated Generation of Discharge Summaries in Italian. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024, Turin, Italy, 20 May 2024; Demner-Fushman, D., Ananiadou, S., Thompson, P., Ondov, B., Eds.; ELRA and ICCL: Turin, Italy, 2024; pp. 137–144. [Google Scholar]
  23. de Salut, D. Implantació del Conjunt Mínim Bàsic de Dades d’Atenció Primària (CMBD-AP) i d’Urgències (CMBD-UR)(06/2012); Scientia: Bristol, UK, 2012. [Google Scholar]
  24. Kind, A.J.; Smith, M.A. Documentation of Mandated Discharge Summary Components in Transitions from Acute to Subacute Care. In Advances in Patient Safety: New Directions and Alternative Approaches (Vol. 2: Culture and Redesign); Henriksen, K., Battles, J.B., Keyes, M.A., Grady, M.L., Eds.; Agency for Healthcare Research and Quality (US): Fishers Lane Rockville, MD, USA, 2008. [Google Scholar]
  25. Xiong, Y.; Tang, B.; Chen, Q.; Wang, X.; Yan, J. A Study on Automatic Generation of Chinese Discharge Summary. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 1681–1687. [Google Scholar]
  26. Taylor, A. How Real-World Businesses Are Transforming with AI—with More than 140 New Stories. Off. Microsoft Blog. 2025. Available online: https://blogs.microsoft.com/blog/2025/04/22/https-blogs-microsoft-com-blog-2024-11-12-how-real-world-businesses-are-transforming-with-ai/ (accessed on 1 March 2025).
  27. O’Brien, M.; Parvini, S. In 2024, Artificial Intelligence Was All About Putting AI Tools to Work. Assoc. Press News. 2024. Available online: https://apnews.com/article/ai-artificial-intelligence-0b6ab89193265c3f60f382bae9bbabc9 (accessed on 1 March 2025).
  28. Garbuio, M.; Lin, N. Innovative Idea Generation in Problem Finding: Abductive Reasoning, Cognitive Impediments, and the Promise of Artificial Intelligence. J. Prod. Innov. Manag. 2021, 38, 701–725. [Google Scholar] [CrossRef]
  29. Chun Tie, Y.; Birks, M.; Francis, K. Grounded Theory Research: A Design Framework for Novice Researchers. SAGE Open Med. 2019, 7, 2050312118822927. [Google Scholar] [CrossRef] [PubMed]
  30. Brown, T. Harvard Business Review. June 2008. Available online: https://readings.design/PDF/Tim%20Brown,%20Design%20Thinking.pdf (accessed on 1 March 2025).
  31. Brankaert, R.; Ouden, E. The Design-Driven Living Lab: A New Approach to Exploring Solutions to Complex Societal Challenges. Technol. Innov. Manag. Rev. 2017, 7, 44–51. [Google Scholar] [CrossRef]
  32. Héder, M. From NASA to EU: The Evolution of the TRL Scale in Public Sector Innovation. Innov. J. 2017, 22, 1–23. [Google Scholar]
  33. Hevner, M.; Park, R. Design Science in Information Systems Research. MIS Q. 2004, 28, 25148625. [Google Scholar] [CrossRef]
  34. Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.; et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv 2022, arXiv:2205.10625. [Google Scholar]
  35. Hao, Y.; Sun, Y.; Dong, L.; Han, Z.; Gu, Y.; Wei, F. Structured Prompting: Scaling In-Context Learning to 1000 Examples. arXiv 2022, arXiv:2212.06713. [Google Scholar]
  36. Jung, J.; Qin, L.; Welleck, S.; Brahman, F.; Bhagavatula, C.; Bras, R.L.; Choi, Y. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations. arXiv 2022, arXiv:2205.11822. [Google Scholar]
  37. Shortliffe, E.H.; Cimino, J.J. Biomedical Informatics: Computer Applications in Health Care and Biomedicine; Shortliffe, E.H., Cimino, J.J., Eds.; Springer: London, UK, 2014; ISBN 978-1-4471-4473-1. [Google Scholar]
  38. Carter, N.; Bryant-Lukosius, D.; DiCenso, A.; Blythe, J.; Neville, A.J. The Use of Triangulation in Qualitative Research. Oncol. Nurs. Forum 2014, 41, 545–547. [Google Scholar] [CrossRef]
  39. Wang, W.; Duffy, A. A Triangulation Approach for Design Research. In Proceedings of the ICED 09 17th International Conference on Engineering Design, Palo Alto, CA, USA, 24–27 August 2009; The Design Society: Glasgow, Scotland, 2009; Volume 2, pp. 275–286. [Google Scholar]
  40. Auriemma Citarella, A.; Barbella, M.; Ciobanu, M.G.; De Marco, F.; Di Biasi, L.; Tortora, G. Assessing the effectiveness of ROUGE as unbiased metric in Extractive vs. Abstractive summarization techniques. J. Comput. Sci. 2025, 87, 102571. [Google Scholar] [CrossRef]
  41. van Zandvoort, D.; Wiersema, L.; Huibers, T.; van Dulmen, S.; Brinkkemper, S. Enhancing Summarization Performance through Transformer-Based Prompt Engineering in Automated Medical Reporting. arXiv 2023, arXiv:2311.13274. [Google Scholar]
  42. Lin, C.-Y.; Och, F. Looking for a Few Good Metrics: ROUGE and Its Evaluation. In Proceedings of the NTCIR Workshop, Tokyo, Japan, 2–4 June 2004; pp. 1–8. [Google Scholar]
  43. Aussó, S.; Berenguer, A.; Aznar, J.; Raventós, C.; Gómez, V.; Bretones, M. Guia de Bones Pràctiques per al Desenvolupament d’Eines d’IA Generativa en Salut. Grans Models de Llenguatge (LLM); Fundació TIC Salut Social, Generalitat de Ctalunya: Barcelona, Spain, 2025. [Google Scholar]
  44. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
  45. Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2685–2702. [Google Scholar]
  46. Saggion, H.; Torres-Moreno, J.-M.; da Cunha, I.; SanJuan, E. Multilingual Summarization Evaluation without Human Models. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China, 23–27 August 2010; Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 1059–1067. [Google Scholar]
  47. Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, Online, 3–18 July 2020; pp. 11328–11339. [Google Scholar]
  48. Ahuir, V.; Hurtado, L.-F.; González, J.Á.; Segarra, E. NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish. Appl. Sci. 2021, 11, 9872. [Google Scholar] [CrossRef]
  49. Adams, G.; Alsentzer, E.; Ketenci, M.; Zucker, J.; Elhadad, N. What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4794–4811. [Google Scholar]
  50. Ando, K.; Komachi, M.; Okumura, T.; Horiguchi, H.; Matsumoto, Y. Is In-Hospital Meta-Information Useful for Abstractive Discharge Summary Generation? In Proceedings of the 2022 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), Taichung, Taiwan, 8–10 December 2022; pp. 143–148. [Google Scholar] [CrossRef]
  51. Irugalbandara, C.; Mahendra, A.; Daynauth, R.; Arachchige, T.K.; Dantanarayana, J.; Flautner, K.; Tang, L.; Kang, Y.; Mars, J. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI’s LLM with Open Source SLMs in Production. In Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Indianapolis, IN, USA, 5–7 May 2024. [Google Scholar]
  52. Lohn, A.J. Scaling AI. Cost and Perfomance of AI at the Leading Edge; Center for Security and Emerging Technology: Washington, DC, USA, 2023. [Google Scholar]
Figure 1. The three phases through which the solution has been tested and iterated against the technological readiness level assessment (TRL) to define the maturity level of a technology—the recommended model to asses technology acquisition by the European Commission [32].
Figure 1. The three phases through which the solution has been tested and iterated against the technological readiness level assessment (TRL) to define the maturity level of a technology—the recommended model to asses technology acquisition by the European Commission [32].
Computers 14 00210 g001
Figure 2. Schematic diagram illustrating the human–machine interaction.
Figure 2. Schematic diagram illustrating the human–machine interaction.
Computers 14 00210 g002
Figure 3. Screenshot of the hospital software interface (in-house developed, Hospital Clínic de Granollers, Granollers Catalonia, Sapin) showing the AI-HDR service output after receiving the response via API call (“IAIA”, as developed in-house for the present study, v0, Innex Labs S.L., Vilanova i la Geltrú, Catalonia, Spain) after being accessed from the bed-ward interface.
Figure 3. Screenshot of the hospital software interface (in-house developed, Hospital Clínic de Granollers, Granollers Catalonia, Sapin) showing the AI-HDR service output after receiving the response via API call (“IAIA”, as developed in-house for the present study, v0, Innex Labs S.L., Vilanova i la Geltrú, Catalonia, Spain) after being accessed from the bed-ward interface.
Computers 14 00210 g003
Table 1. LLM model evaluation in early abductive trials. “+” and “-“ respond to a Harry’s profile perceptive qualitative evaluation of positive and negative points. Empty spots are left when there is no available data due to exclusion of the model before such requirement was evaluated.
Table 1. LLM model evaluation in early abductive trials. “+” and “-“ respond to a Harry’s profile perceptive qualitative evaluation of positive and negative points. Empty spots are left when there is no available data due to exclusion of the model before such requirement was evaluated.
ModelGoogle’s Gemini Pro 1.5Anthropic’s
Claude 3.5
Sonnet
OpenAI’s GPT-4oMistralLlama 3.1
Summaries Performance+++++++-+-
Multilingual SupportYesYesNoNoNo
Service StabilityYesNo
Average Price (EUR)0.250.220.420.220.30
Average Generation Time (s)147133215>300200
Table 2. ROUGE results for the generation of HDR with AI in 6 disciplines.
Table 2. ROUGE results for the generation of HDR with AI in 6 disciplines.
SpecialityR1-PrecisionR1-RecallR1-F1R2-PrecisionR2-RecallR2-F2RL-PrecisionRL-RecallRL-F
Gynecology0.160.490.240.060.190.090.080.260.12
Vascular surgery0.180.490.250.070.190.100.100.280.15
Urology0.220.570.310.110.270.150.140.360.19
Surgery0.310.600.400.130.260.170.140.290.18
Cardiology0.520.430.440.210.190.190.240.210.21
Pneumology0.550.610.570.310.350.330.310.350.33
Table 3. ROUGE results for the generation of multilingual HDR with 3 language pairs in the clinical course and the HDR.
Table 3. ROUGE results for the generation of multilingual HDR with 3 language pairs in the clinical course and the HDR.
Language PairR1-PrecisionR1-RecallR1-F1R2-PrecisionR2-RecallR2-F2RL-PrecisionRL-RecallRL-FNº of Cases
es-es0.390.570.430.200.280.220.220.330.2435
ca-es0.240.460.290.080.170.100.110.240.1416
ca-ca0.180.520.260.070.200.100.090.260.139
Table 4. Qualitative comparison of human- and AI-generated HDRs.
Table 4. Qualitative comparison of human- and AI-generated HDRs.
HDR DomainMismatched CasesPatterns Observed
Admission chronology12/27 (44%)Date or level-of-care pathway (e.g., ED vs. ward) wrong or missing in AI report. Examples: Case 4 (01-06 vs. 31-05), Case 5 (04-05 vs. 03-05), Case 6 (three different dates documented).
Discharge date6/27 (22%)AI usually matched; human had the occasional extra day noted (e.g., Case 11).
Social/baseline context14/27 (52%)Human versions added living situation, occupation, or functional status; AI also included some details, but not always the same as the clinician and omitted information (e.g., Case 6 and Case 2), sometimes not found in the course documentation.
Diagnostic labels11/27 (41%)Humans frequently added comorbid or situational diagnoses (e.g., SARS-CoV-2, acidosis, and psychiatric history) that AI left out or phrased more broadly (e.g., Cases 3, 14, and 25).
Medication and dosage15/27 (56%)Divergences in dose (amoxicillin 1 g q8h vs. 500 mg q12h), omission of gastro-protection, or additional heparin/NSAIDs only in human report (Cases 4, 7, 10).
Investigations10/27 (37%)AI gave a list without dates/values, human-specified results, or added studies (extra ECG and sensitivity panel) (Cases 6, 10, 22).
Follow-up plans12/27 (44%)Human follow-ups were generally vague while AI added additional details, in some cases hallucinated. (e.g., Case 6, 25)
Table 5. HDR flagged by professionals on each of the expected issues an AI HADR could contain.
Table 5. HDR flagged by professionals on each of the expected issues an AI HADR could contain.
ScoreReports (n)Percentage of HDR
Factual errors2246.8%
Missing information2553.2%
Writing/style issues1327.7%
Correct (only minor edits needed)24.3%
Table 6. Qualitative analysis of medical professional comments on the AI-generated HDR.
Table 6. Qualitative analysis of medical professional comments on the AI-generated HDR.
ScoreReports (n)Percentage of HDR
Factual errors919.1%
Missing information2246.8%
Lengthy or too short evolution612.8%
Writing/style issues510.6%
Duplicated data36.4%
Outdated data12.1%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Trejo Omeñaca, A.; Llargués Rocabruna, E.; Sloan, J.; Catta-Preta, M.; Ferrer i Picó, J.; Alfaro Alvarez, J.C.; Alonso Solis, T.; Lloveras Gil, E.; Serrano Vinaixa, X.; Velasquez Villegas, D.; et al. Leave as Fast as You Can: Using Generative AI to Automate and Accelerate Hospital Discharge Reports. Computers 2025, 14, 210. https://doi.org/10.3390/computers14060210

AMA Style

Trejo Omeñaca A, Llargués Rocabruna E, Sloan J, Catta-Preta M, Ferrer i Picó J, Alfaro Alvarez JC, Alonso Solis T, Lloveras Gil E, Serrano Vinaixa X, Velasquez Villegas D, et al. Leave as Fast as You Can: Using Generative AI to Automate and Accelerate Hospital Discharge Reports. Computers. 2025; 14(6):210. https://doi.org/10.3390/computers14060210

Chicago/Turabian Style

Trejo Omeñaca, Alex, Esteve Llargués Rocabruna, Jonny Sloan, Michelle Catta-Preta, Jan Ferrer i Picó, Julio Cesar Alfaro Alvarez, Toni Alonso Solis, Eloy Lloveras Gil, Xavier Serrano Vinaixa, Daniela Velasquez Villegas, and et al. 2025. "Leave as Fast as You Can: Using Generative AI to Automate and Accelerate Hospital Discharge Reports" Computers 14, no. 6: 210. https://doi.org/10.3390/computers14060210

APA Style

Trejo Omeñaca, A., Llargués Rocabruna, E., Sloan, J., Catta-Preta, M., Ferrer i Picó, J., Alfaro Alvarez, J. C., Alonso Solis, T., Lloveras Gil, E., Serrano Vinaixa, X., Velasquez Villegas, D., Romeu Garcia, R., Rubies Feijoo, C., Monguet i Fierro, J. M., & Bayes Genis, B. (2025). Leave as Fast as You Can: Using Generative AI to Automate and Accelerate Hospital Discharge Reports. Computers, 14(6), 210. https://doi.org/10.3390/computers14060210

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop