Next Article in Journal
Applying Machine Learning for Analyzing Shooting Importance in Modern Pentathlon
Previous Article in Journal
Few-Shot Leukocyte Classification Algorithm Based on Feature Reconstruction Network with Improved EfficientNetV2
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

One Report, Multifaceted Views: Multi-Expert Rewriting for ECG Interpretation

1
Department of Convergence Software, Hallym University, Chuncheon-si 24252, Republic of Korea
2
Department of Neurology, Chuncheon Sacred Heart Hospital, Chuncheon-si 24253, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9376; https://doi.org/10.3390/app15179376
Submission received: 31 July 2025 / Revised: 21 August 2025 / Accepted: 25 August 2025 / Published: 26 August 2025

Abstract

Data scarcity is a significant barrier to developing high-performing AI models for medical text classification. To improve stroke prediction from electrocardiogram (ECG) interpretations reports where training data are scarce, we propose a novel data augmentation technique, Multi-Expert Perspective Augmentation. We use the LLM Phi-4 to rewrite original machine-generated ECG reports from the simulated perspectives of five different medical specialists. This prompt-based approach generates text with diverse clinical viewpoints while preserving the original meaning. We trained a BiomedBERT-based Gradient Boosting model to classify for stroke, comparing a baseline model trained only on the original data against a model trained with our augmented data. The model trained solely on original data showed poor performance (F1-score: 0.6698) with a severe precision–recall imbalance. In contrast, the augmented model achieved a significantly improved and balanced performance, with an F1-score of 0.8421, an accuracy of 0.8427, a precision of 0.8571, and a recall of 0.8276. Our method also outperformed other LLM-based augmentation techniques. Our findings demonstrate that rewriting text from multiple simulated expert perspectives is an effective strategy for data augmentation, enhancing the linguistic and contextual diversity of training data and leading to a more balanced and accurate classification model.

1. Introduction

In recent years, artificial intelligence (AI) has driven significant progress in medicine by supporting a wide range of clinical tasks. The growing availability of digital medical data, such as electronic health records and clinical reports, has been central to this progress [1]. Textual data—including physicians’ notes and interpretation reports—are critical because they describe a patient’s condition and the reasoning behind clinical decisions. For example, electrocardiogram (ECG) reports provide key information about cardiac activity and are essential for diagnosing heart disease. Specific abnormalities in ECGs, such as atrial fibrillation, are also known to increase the risk of stroke. As a result, tools that automatically classify or analyze ECG reports play a vital role in early diagnosis, risk assessment, and clinical decision support [1].
A major obstacle to developing AI models for medical text is the lack of high-quality training data. Because medical data are sensitive and complex, their use is tightly restricted. Publicly available clinical text datasets are therefore small and fragmented [2]. Privacy regulations further complicate data sharing, making it difficult to build large, annotated datasets, which have been identified as a significant factor limiting the performance of medical machine learning models [3]. In practice, deep learning models for natural language processing (NLP) require large datasets to perform well. However, they often struggle in the medical domain due to limited data and class imbalance [1].
To address this issue, researchers have investigated various text augmentation techniques. Traditional methods include synonym replacement and back translation. More recently, advanced methods use generative models to create synthetic clinical text. However, text augmentation has unique challenges. Unlike images, even minor edits in text can change its meaning, making it difficult to preserve the original semantics and labels [4]. For this reason, conventional augmentation methods have shown limited effectiveness in the medical domain.
The rise of Large Language Models (LLMs) offers new opportunities. Several studies have explored the use of LLMs for data augmentation in clinical contexts. For example, Van Nooten et al. reported using Generative Pre-trained Transformer (GPT)-3.5 to generate synthetic social media posts related to vaccine refusal, which improved classification performance when added to training data [5]. Lu et al. used GPT-2 to generate synthetic clinical notes and improved patient readmission prediction [6]. Similarly, Bird et al. showed that GPT-2 could enhance the classification of EEG/EMG signals by generating biosignal data [7]. While promising, LLM-generated text can be highly dependent on training data and prompts, which limits its diversity and realism.
In this study, we propose a new augmentation method that rewrites ECG reports from the perspectives of multiple medical specialists. Unlike prior work that focused on paraphrasing or single-expert simulation, our method prompts an LLM to generate diverse but clinically faithful rewrites of the same report. We chose the Phi-4 model [8] for its strong reasoning and language abilities, which make it well-suited for simulating nuanced expert perspectives while remaining relatively efficient. This approach increases both the surface-level and contextual diversity of the data while preserving the original labels. Thus, it helps the model learn more robust features, reduces overfitting, and improves generalization.
To evaluate this approach, we constructed a Gradient Boosting model based on BiomedBERT [9] and tested it with the augmented dataset. Compared with training only on the original reports, the augmented data led to clear improvements in performance. Accuracy increased from 0.5035 to 0.8427, precision from 0.5053 to 0.8571, recall from 0.9931 to 0.8276, and F1-score from 0.6698 to 0.8421. Notably, the model with augmented data achieved balanced precision and recall, overcoming the class imbalance problem.
The rest of this paper is organized as follows. Section 2 reviews related work on text augmentation. Section 3 introduces our augmentation technique and classification model. Section 4 presents the experimental setup and results. Section 5 discusses implications and limitations. Section 6 concludes the study and outlines directions for future research.

2. Related Work

2.1. Traditional Text Augmentation Methods

In NLP, researchers have explored a broad range of techniques for augmenting text data. A representative example is Easy Data Augmentation (EDA) by Wei and Zou, which diversifies data via four simple operations: synonym replacement, random insertion, random swap, and random deletion [10]. Although EDA is easy and fast to implement, its overly simplistic transformations can produce awkward contexts or inadvertently alter critical medical terminology, thereby harming downstream performance. Along the same line, Huong and Hoang proposed a synonym substitution-based augmentation technique for Vietnamese sentiment analysis to increase data diversity, and Qiu et al. introduced EasyAug. This automated platform systematizes random augmentation for classification tasks [11,12].
To preserve semantics while altering surface forms, back translation—translating a source sentence into another language and then translating it back—is widely used [13]. While back translation often induces relatively low semantic drift compared to random perturbations, its efficacy is ultimately constrained by the quality of machine translation and potential domain mismatch, which can introduce unnatural phrasing or subtle semantic errors.
Another prominent line of work is contextual word replacement, where pretrained language models (e.g., BERT) predict contextually appropriate substitutions instead of relying on random synonyms. For instance, Wu et al. proposed a label-conditioned BERT-based augmentation method that replaces words while respecting the sentence’s label information, demonstrating superior semantic preservation compared to random synonym replacement [14].
Researchers have also proposed more knowledge and optimization-driven strategies. Shi et al. introduced Medical Data Augmentation (MDA), which leverages a medical knowledge graph to identify technical terms in Chinese clinical texts and replace them with semantically consistent alternatives, thereby achieving better performance than general purpose methods [15]. Xiang et al. explored lexicon-based augmentation to model subtle emotional nuances at the word level for sentiment analysis [16]. Kumar et al. employed submodular optimization to efficiently select diverse paraphrases, empirically validating the effectiveness of augmentation [17]. In addition, adversarially inspired approaches have emerged, in which a conditional generator and a conditional paraphraser collaborate under the guidance of a pretrained discriminator to produce diverse yet semantically faithful sentences, aiming to improve both robustness and performance [18].
Overall, however, traditional augmentation techniques often fail to reflect the subtle nuances found in reports by human experts. This limitation is particularly critical in medical texts, where the choice of terminology can carry diagnostic significance. The naive application of these methods can lead to critical errors such as semantic drift, where the methods subtly alter a term’s clinical meaning, or outright hallucinations, where they generate fabricated clinical information, posing significant risks.

2.2. Prompt-Based Text Augmentation Methods

Prompt-based augmentation has recently gained traction as Large Language Models (LLMs) have demonstrated strong zero-/few-shot generative capabilities without additional fine-tuning. By providing carefully crafted instructions or roles, LLMs such as GPT-3.5/4 can generate label-conditioned, paraphrased, or entirely new samples that increase both the size and diversity of training corpora. Empirical studies report that adding LLM generated sentences to training data can improve performance on tasks such as sentiment analysis and medical sentence classification. For example, Piedboeuf and Langlais showed that paraphrasing training sentences with ChatGPT 3.5 led to greater gains than back translation in both news and sentiment classification [19]. Ubani et al. demonstrated that zero-shot augmentation—prompting GPT-3.5 to directly generate class-specific samples—can enhance classification performance without providing example instances [20].
Beyond paraphrasing or label-conditioned generation, researchers have also compared rewriting vs. de novo generation strategies. Zhao et al. found that generating new text conditioned on labels outperformed sentence-level rewrites of existing samples [21]. Peng et al. proposed CoTAM, which manipulates only specific task attributes through Chain-of-Thought (CoT)-guided editing, showing substantial gains in low-resource scenarios across multiple NLP tasks [22].
Parallel to these efforts, there is a growing interest in expert simulation and multi-agent collaboration with LLMs. Liu et al. introduced PersonaFlow, which orchestrates multiple virtual expert personas to brainstorm research ideas, improving both creativity and specificity through diverse perspectives [23]. Li et al. proposed a multi-agent, role-specialized framework in which LLMs debate and reach consensus, yielding more accurate and reliable solutions to complex scientific problems compared to a single LLM [24]. While these approaches are not primarily designed for medical text augmentation, they highlight a crucial insight: LLMs can serve as virtual domain experts, producing heterogeneous, knowledge-rich textual outputs that extend beyond surface-level paraphrasing.
While data augmentation for ECG signal data is a well-established field of study, the augmentation of ECG interpretation text remains a relatively underexplored area. The existing LLM-based text augmentation studies show promise but often generate generic paraphrases that may lose clinical nuance. Unlike these previous approaches, our method emulates the domain reasoning of multiple distinct specialists. By simulating varied expert perspectives for the same clinical finding, our Multi-Expert Perspective Augmentation aims to generate a dataset that is not only larger but also richer in clinically grounded, diverse interpretations, directly addressing a key gap in the literature.

2.3. Recent Advances in LLM-Based Clinical Text Augmentation

Recent research has increasingly turned to Large Language Models to address the challenges of data scarcity in clinical natural language processing tasks. In particular, several studies in 2025 have proposed novel augmentation strategies leveraging LLMs to generate synthetic clinical texts that improve downstream model performance while preserving domain-specific fidelity.
Liu et al. proposed a framework using locally hosted open-source LLMs to synthesize emergency limb X-ray radiology reports and evaluated their utility for detecting misdiagnosed fractures [25]. They showed that, in the best few-shot settings, synthetic reports used alone could recover over 90% of the classification performance achieved with real data (baseline macro-F1 0.825 vs. synthetic-only ≈ 0.76). They also found that augmented training improved performance and that prompting strategy and model choice affected diversity and downstream gains.
Wei et al. proposed a structured LLM-based augmentation framework for clinical Named Entity Recognition (NER) and Relation Extraction (RE) [26]. They encode entities and relations within note segments using structured markers and prompt an LLM to rewrite the segment while preserving the label structure and contextual relationships; the LLM varies entity surface forms to introduce expression diversity and decodes its outputs to produce updated entity and relation annotations. The system recombines augmented segments with surrounding context to form full synthetic clinical notes and uses them to train a segmentation-based BlueBERT-large model; it fuses segment-level embeddings via a BiLSTM to restore global context before NER and RE prediction. Evaluations on i2b2-2012, N2C2-2018, and a proprietary Truveta dataset showed consistent F1 improvements under both strict and lenient settings. However, gains under strict evaluation were modest in some cases, and strict NER sometimes exhibited slight decreases in recall.
Šuvalov et al. generated synthetic Estonian EHRs using a locally trained GPT-2 and annotated these texts primarily with GPT-4 and fine-tuned an XLM-RoBERTa NER model [27]. The downstream model achieved F1 scores of 0.69 and 0.38 for drug and procedure extraction on a real-world test set, respectively. The study demonstrates the feasibility of LLM-assisted synthetic annotation for non-English clinical contexts, providing a privacy-preserving pipeline for scalable NER training.
While these recent works mark meaningful progress in LLM-based clinical text augmentation, several limitations remain. First, most approaches rely on sentence-level rewriting or structured paraphrasing but do not explicitly account for the nuanced, expert-level diagnostic reasoning often embedded in clinical narratives. Second, although researchers report performance gains, these improvements tend to be task-specific, as they focus on classification or entity extraction rather than generating holistic, diverse interpretations of medical findings. Lastly, challenges such as semantic drift, hallucination, and overfitting to synthetic patterns continue to pose risks in real-world clinical applications.
To address these gaps, our proposed method—Multi-Expert Perspective Augmentation—simulates diverse expert reasoning by generating multiple interpretations of the same clinical event. Rather than simply paraphrasing sentences or replicating entity structures, our method emphasizes semantic diversity, clinical nuance, and interpretive variability grounded in domain knowledge. This emphasis on semantic diversity and expert-grounded variability enables us to create synthetic datasets that better reflect the breadth of real-world expert reasoning, advancing both data quality and model generalizability.

3. Methodology

Figure 1 illustrates the architecture of the proposed method. An original ECG interpretation report was input into an LLM with instructions to rewrite the report from the perspectives of five experts. For each original report, the LLM generated five new reports, which we then compiled to build the ‘Generated ECG interpretation dataset’. This dataset, along with the original report data, was used to train a BiomedBERT-based Gradient Boosting Classification Model to classify the presence or absence of a stroke. The threshold for this classification was the optimal threshold obtained by evaluating a validation set.

3.1. Dataset

In this study, we utilized ECG interpretation text data collected from Hallym University Chuncheon Sacred Heart Hospital. The original dataset comprised a total of 3798 reports, including 2220 reports from healthy subjects (the control group) and 1578 reports from patients with stroke (the patient group). A machine automatically generated these reports.
The original reports contained meta-information tags irrelevant to diagnostic content, such as “[장비판독]” (Machine Interpretation) and “*** Poor data quality.” We removed these tags prior to analysis. After additionally removing duplicate entries, the cleaned dataset consisted of 1236 reports in total, comprising 638 reports for the control group and 598 reports for the patient group.
Many reports contained phrases that appear regardless of class, such as “Normal sinus rhythm,” which describe general cardiac status but are not indicative of stroke presence. Notably, the reports did not include explicit statements directly confirming or ruling out stroke, ensuring that the classification task relies on indirect features rather than overt diagnostic labels.
To illustrate the dataset, Figure 2 presents representative lead I ECG visualizations for a stroke case (upper panel) and a normal case (lower panel). In parallel, Table 1 provides the corresponding machine-generated ECG interpretation texts for these same ECGs, showing typical examples of Normal and Stroke reports.

3.2. Multi-Expert Perspective Augmentation

To increase the diversity of our training data, we applied a text augmentation method that asks LLM to rewrite ECG reports from the perspectives of different medical specialists. The goal of this approach was to keep the essential clinical information intact while using LLM’s knowledge to generate alternative ways of expressing the same findings. In this way, we simulated how different experts might describe identical results. By producing multiple variations of each report, we expanded the linguistic and contextual diversity of the dataset. This process captures subtle differences in vocabulary that specialists might use, removes the need for manual annotation, and increases the number of realistic samples. As a result, the classification model can learn from a broader range of expressions and styles, which helps improve its generalization performance.
Clinical interpretation reports naturally vary in expression and nuance. The same ECG finding can be described differently depending on the expert or context. For example, one specialist might write as follows: “ECG shows accelerated junctional rhythm and nonspecific T wave abnormalities, impacting systemic circulation assessment.” Another might phrase it as follows: “ECG presents accelerated junctional rhythm and nonspecific T wave abnormalities, indicating potential arrhythmias and myocardial ischemia.” Although the meaning is the same, these variations in terminology, detail, and sentence structure reflect the natural diversity of clinical expression.
Equation (1) represents the proposed augmentation method, where D   =   { d 1 ,   d 2 ,   . . . ,   d n } denotes the original ECG report dataset. The augmented dataset D a u g is constructed by generating K rewritten versions of each report d i . π ( θ , p k , d i ) denotes the output of the LLM that rewrites the input d i using the prompt p k , where θ represents the model parameters and p k corresponds to the perspective of the k -th medical expert. In this study, K was set to 5.
D a u g = k = 1 K { π ( θ , p k , d i ) | d i D }
For this Multi-Expert Perspective Augmentation, we used Microsoft’s Phi-4 model, a language model designed for efficient and high-quality reasoning. Phi-4 generated concise, machine-like versions of the original ECG reports. We provided each report with a structured prompt that instructed the model to: (i) rewrite it into five different versions, each from the viewpoint of a specialist (cerebrovascular specialist, neurologist, cardiologist, radiologist, and circulatory specialist); (ii) use medically accurate but concise language consistent with machine-generated reports; (iii) avoid repetition, speculation, or explicit mention of stroke status; and (iv) return the results strictly in a valid JSON format, with each specialist’s version as a separate field. Table A1 shows the full prompt, and Table 2 presents an example of the augmented outputs.
For instance, the original text “Sinus tachycardia Low voltage QRS Borderline ECG” is presented as a simple list of key terms. Using our augmentation method, the cardiologist’s version highlights cardiac function and recommends “further cardiac assessment and potential imaging,” while the circulatory specialist emphasizes “circulatory system disturbances” and suggests “further hemodynamic evaluation.” These role-based rewrites produce subtle but meaningful differences in vocabulary, interpretation, and recommended follow-up actions, thereby enhancing the qualitative diversity of the data.
This prompt-based rewriting method is both efficient and flexible. By adjusting prompts or generation parameters, we can easily generate diverse clinical styles. Since our approach rewrites existing reports rather than generating completely new ones, it reduces the risk of producing unrealistic content or factual errors. Notably, the method preserves the original clinical meaning while varying terminology and sentence structure. By preserving the original clinical meaning while varying terminology and sentence structure, our method discourages the model from overfitting to superficial patterns and instead encourages it to learn stronger associations between a broader range of clinical expressions and their diagnostic labels. In this way, our Multi-Expert Perspective Augmentation produces data that are both linguistically diverse and clinically reliable, ultimately improving the performance and generalization of the classification model.

3.3. BiomedBERT-Based Gradient Boosting Classification Model

In our classification stage, we developed a hybrid model by adapting the GrowNet architecture [28], a gradient boosting-based neural network. We chose GrowNet because it performs well with limited data; its boosting process can build a strong classifier even when training examples are scarce, which fits our problem setting. Our main innovation was replacing the standard Multi-layer Perceptron (MLP) weak learners in GrowNet with BiomedBERT, a BERT variant pre-trained on the biomedical literature.
As shown in Figure 3, our model does not use BiomedBERT only once as a feature extractor. Instead, it employs BiomedBERT as a sequence of domain-specific weak learners within the boosting framework. Each BiomedBERT learner receives the original ECG report and focuses on correcting the errors made by the previous learner. This design integrates deep language understanding into the step-by-step error correction process of gradient boosting. By combining BiomedBERT’s representational strength with the boosting mechanism, we aimed to achieve both high predictive accuracy and robust performance.
The final prediction of our model is formulated as shown in Equation (2), where d is the input ECG report text. The term h m ( d ; θ m ) represents the m -th weak learner, which is a BiomedBERT model with its own set of parameters θ m . This weak learner takes the raw text d as direct input. The outputs of all M weak learners (in our study, M = 3 ) are summed and passed through a sigmoid function, σ ( · ) . The final binary prediction, y ^ , is determined by the indicator function I ( · ) , which returns 1 if the resulting probability exceeds the optimal threshold τ , and 0 otherwise.
y ^ = I σ m = 1 M h m d ; θ m > τ
Finally, for model inference, rather than using a fixed threshold, we selected the optimal threshold of 0.629, as this value maximizes the F1-score on the validation data and ensures a realistic classification performance by balancing precision and recall.

4. Experimental Results

4.1. Evaluation

To quantitatively evaluate the effectiveness of our data augmentation technique, we constructed two models and compared their performance: one model trained solely on the original data, and another model trained on the original data combined with the augmented data. We evaluated the performance of each model based on four metrics: accuracy, precision, recall, and F1-score.
Equation (3) shows the formulas for calculating each evaluation metric for the classification model, where i represents the evaluation metric and j represents the type of data used for model training, where o r i g denotes the original data and a u g denotes the original data combined with the augmented data. M j is the classification model trained using data type j , and D t e s t represents the test dataset. m e t i r c i ·   is the function that calculates the evaluation metric i , and i j represents the value of the metric derived by the model on the test dataset.
i { Accuracy , Precision , Recall , F 1 } , j { orig , aug } i j = metric i M j D test

4.2. Experimental Setup

We conducted experiments on a single Nvidia A100 GPU. To ensure diversity during the text generation process, we set the temperature to 1.0 and top-p to 0.9, and max new tokens to 500. For the classification model, we set the number of weak learners to 3, used the AdamW optimizer with a learning rate of 2 × 10−5, and set the batch size to 32.

4.3. Performance on Original vs. Augmented Data

First, we compared the performance of a model trained solely on the original data with that of a model trained using a combination of the original data and data augmented from our proposed expert perspectives. To construct the augmented dataset, we generated five rewritten reports for each original text, which resulted in 3189 augmented reports for the control group and 2976 for the patient group. Although slightly below the intended total due to occasional failures in valid output generation, the final dataset expanded to 3827 control and 3574 patient reports, amounting to 7401 reports in total. The combined dataset was randomly split into training (80%), validation (10%), and test (10%) sets, ensuring that reports from the same original sample did not overlap across splits.
By rewriting existing reports instead of fabricating entirely new clinical scenarios, our method inherently reduced the risk of generating unrealistic cases. Nonetheless, to assess potential risks of hallucination, we conducted a manual review of 100 randomly sampled augmented reports. This review confirmed no critical factual inconsistencies that would alter the primary diagnosis, supporting the reliability of our augmentation strategy.
Table 3 and Figure 4 summarize the performance of the two models. The model trained solely on the original data achieved an accuracy of 0.5496 (95% CI: 0.4953–0.6012), with precision at 0.5549 (0.5000–0.6062) and recall at 0.9827 (0.9609–1.0000). This imbalance indicates that the model primarily classified samples as belonging to the patient group, which secured very high recall but led to a surge in False Positives and degraded precision. The optimal threshold was set at 0.547, indicating that the model achieved its best F1-score of 0.7089 (0.6625–0.7510) by labeling even low-confidence predictions as the patient class.
In contrast, the model trained with our augmented dataset exhibited markedly improved and well-balanced performance, achieving an accuracy of 0.8434 (0.8037–0.8816), precision of 0.8373 (0.7857–0.8876), and recall of 0.8935 (0.8491–0.9351). As a result, the F1-score rose to 0.8642 (0.8264–0.8985), with an AUROC of 0.9025 (0.8690–0.9319). Significantly, the optimal threshold increased to 0.629, reflecting the model’s enhanced ability to distinguish between control and patient groups with higher prediction confidence.
Beyond accuracy, the AUROC results underscore the improvement achieved by our approach. While the original model attained only 0.6290 (0.5675–0.6911), the augmented model reached 0.9025 (0.8690–0.9319), demonstrating robust discriminative power across different thresholds. Clinically, this means that the augmented model is not only reliable at its optimal operating point but also resilient under varying decision thresholds, an important property for deployment in real-world screening settings where clinicians often adjust sensitivity and specificity to context.
Taken together, these findings demonstrate that our augmentation strategy substantially improves both the quantitative performance metrics and the clinical reliability of the classification model. The balanced precision–recall profile, along with the significant AUROC gain, suggests that the augmented model can deliver clinically meaningful predictions, reducing unnecessary false alarms while maintaining high sensitivity to stroke-related cases.

4.4. Comparison with Other Augmentation Techniques

To evaluate the effectiveness of our proposed Multi-Expert Perspective Augmentation technique, we included two other recent LLM-based methods in our experiment for comparison: Chain-of-Thought Attribute Manipulation (CoTAM) and another LLM-based data augmentation (DA) method.
CoTAM is a data augmentation technique that directly edits text by leveraging CoT prompting. Unlike traditional methods that manipulate attributes in the latent space, CoTAM (1) dynamically decomposes an input sentence into multiple attributes, (2) formulates a manipulation plan to alter only the target attribute, and (3) reconstructs the sentence based on the modified attribute. This process effectively manipulates the target attribute while preserving the remaining attributes, which ensures the text’s semantic consistency. Table A2 shows the prompt used for this process. This approach generates augmented data that are more controllable and interpretable.
The DA method, on the other hand, can be divided into two types: rewriting existing training data to augment them with similar yet varied samples, or generating entirely new data based on a given label. Table A3 shows the prompt used for this approach. While the rewriting approach ensures data diversity while maintaining the original context, the new data generation approach can increase diversity more significantly by incorporating more novel information.
The dataset sizes produced by each method were similar in scale: CoTAM generated 3183 control and 2979 patient reports, while DA produced 3123 control and 2956 patient reports. Table 4 summarizes the classification performance of models trained on these datasets, compared to our proposed approach.
Table 4 and Figure 5 show that our Multi-Expert Perspective Augmentation achieved the best overall performance, with an accuracy of 0.8434 (95% CI: 0.8037–0.8816), precision of 0.8373 (0.7857–0.8876), recall of 0.8935 (0.8491–0.9351), F1-score of 0.8642 (0.8264–0.8985), and AUROC of 0.9025 (0.8690–0.9319).
When examining optimal thresholds, we observed distinct patterns across the methods. CoTAM achieved the highest recall (0.9052) but required a very low optimal threshold (0.528). The low threshold indicates that the model tended to classify even low-confidence predictions as positive, boosting recall at the cost of precision (0.7359) and resulting in an F1-score of 0.8114. This outcome suggests that, while CoTAM increased data diversity, it also introduced noise that weakened discriminative power. In contrast, DA achieved a relatively high precision (0.7920) by adopting a stricter decision boundary with the highest optimal threshold (0.713). However, this strictness reduced recall (0.7460) and limited the F1-score to 0.7678. Our proposed Multi-Expert Perspective Augmentation maintained a more balanced optimal threshold of 0.629, which enabled it to achieve both high precision (0.8373) and recall (0.8935), ultimately delivering the strongest F1-score (0.8642) and AUROC (0.9025).
From a clinical standpoint, these results highlight an important distinction. Methods like CoTAM, which maximize recall through a low threshold, may lead to excessive false alarms, while DA’s high threshold sacrifices sensitivity and risks missing actual positive cases. In contrast, our approach provides a balanced trade-off that enhances both model confidence and clinical reliability, ensuring accurate detection while minimizing unnecessary follow-ups.

4.5. Comparison with Other LLMs

Finally, we investigated how classification performance varies depending on the type of LLM used for data augmentation. While our primary experiments employed the Phi-4 14B model, we also applied the same augmentation procedure using two alternative models with different architectures and training objectives: HyperCLOVA X Seed 3B [29] and Gemma-2 9B [30].
The dataset sizes produced by each model showed notable differences. Gemma-2 9B generated 3182 control and 2961 patient reports, a scale comparable to Phi-4 14B. In contrast, HyperCLOVA X Seed 3B produced substantially fewer reports (2523 control and 2133 patient), which limited the potential training diversity. Table 5 and Figure 6 summarize the performance of the classification models trained with these augmented datasets.
The choice of LLM led to distinct variations in downstream performance. Gemma-2 9B achieved high recall (0.8943) but lower precision (0.8245), yielding an F1-score of 0.8576. This trade-off was consistent with its relatively low optimal threshold (0.537), which indicates that the model tolerated more False Positives to maximize sensitivity. While Gemma-2′s instruction-tuned design enables it to follow prompts effectively and generate varied outputs, its general purpose pre-training may have limited its ability to capture specific clinical nuances, lowering precision.
In contrast, HyperCLOVA X Seed 3B produced the weakest results, with an F1-score of 0.7439. Despite using a comparatively high optimal threshold (0.651)—which should favor precision—the model underperformed overall. This outcome is likely due to its smaller size, training focus on Korean and multimodal data, and reduced ability to produce nuanced English clinical text, leading to less expressive augmentations and weaker discriminative power in the classifier.
Our proposed Phi-4 14B delivered the strongest overall results, with balanced precision (0.8373) and recall (0.8935) and the highest F1-score (0.8642). Its optimal threshold of 0.629 reflects this balance, allowing the model to achieve both high sensitivity and specificity. We attribute this superior performance to Phi-4′s specialized pre-training on a mix of “textbook-quality” synthetic data and carefully filtered web text, which equips it to simulate expert-like clinical reasoning more effectively.
Taken together, these results highlight that, while model scale influences augmentation quality, the nature and quality of pre-training data are equally critical. LLMs trained with domain-relevant, reasoning-oriented corpora can generate clinically meaningful augmentations that translate into better downstream classification performance.

4.6. Comparison with Traditional Gradient Boosting Models

To further evaluate the effectiveness of the proposed BiomedBERT-based Gradient Boosting model, we compared its performance against two widely used ensemble methods: AdaBoost and XGBoost [31,32]. We trained both traditional models on the same final combined dataset described in Section 3 and evaluated them using the same test set.
Table 6 and Figure 7 summarize the performance of all three models. AdaBoost achieved an accuracy of 0.5386 (95% CI: 0.4825–0.5944), precision of 0.7847 (0.5383–1.0000), recall of only 0.0794 (0.0362–0.1259), and an F1-score of 0.1434 (0.0685–0.2180), with an AUROC of 0.5319 (0.4588–0.6004). The extremely wide confidence interval for precision reflects the instability of the model. Although the model occasionally produced exact predictions, it consistently failed to recall positive cases and therefore missed most of them. This imbalance indicates that AdaBoost was overly conservative, classifying very few samples as positive and thereby achieving high precision at the expense of sensitivity.
XGBoost showed a slightly higher performance, with an accuracy of 0.5836 (0.5280–0.6399), precision of 0.9184 (0.7894–1.0000), recall of 0.1639 (0.1026–0.2333), and an F1-score of 0.2769 (0.1840–0.3736), alongside an AUROC of 0.5312 (0.4583–0.6002). The narrow recall confidence interval indicates consistently low sensitivity, despite the high precision. The consistently low sensitivity, as reflected by the narrow recall confidence interval, suggests that XGBoost classified only a small fraction of stroke cases correctly, again favoring precision over recall. In a clinical context, such a trade-off is problematic, as missing actual stroke cases poses significant risks.
In contrast, our proposed BiomedBERT-based Gradient Boosting model demonstrated markedly superior and more balanced performance. It achieved an accuracy of 0.8434 (0.8037–0.8816), precision of 0.8373 (0.7857–0.8876), recall of 0.8935 (0.8491–0.9351), and an F1-score of 0.8642 (0.8264–0.8985). The AUROC also improved substantially to 0.9025 (0.8690–0.9319). Unlike AdaBoost and XGBoost, both precision and recall were simultaneously high with relatively narrow confidence intervals, indicating stable generalization and reduced risk of overfitting.
Taken together, these findings demonstrate that, while AdaBoost and XGBoost suffered from extreme precision–recall imbalance and unreliable performance as reflected in wide confidence intervals, the proposed BiomedBERT-based Gradient Boosting model consistently delivered robust, balanced, and clinically meaningful predictions.

5. Discussion

We propose a novel LLM based data augmentation technique that rewrites text from expert perspectives to address the problem of data scarcity in ECG interpretation report classification. Based on our experimental results, we now discuss the strengths and limitations of the proposed methodology and reflect on our efforts to enhance performance.

5.1. Methodological Strengths and Limitations

The proposed Multi-Expert Perspective Augmentation not only increases the quantity of data but also significantly enhances its qualitative diversity by effectively reflecting the expressive differences among specialists while preserving clinical meaning. The Multi-Expert Perspective Augmentation using the Phi-4 model achieved both flexibility and accuracy in expression, resulting in a substantial improvement in the F1-score from 0.6698 (using only the original data) to 0.8421. Notably, it recorded the best performance compared to CoTAM and the other DA methods, demonstrating that the linguistic and contextual diversity of clinical texts is crucial for improving classification performance.
However, the proposed methodology has several limitations. The expressive diversity of the augmentation process is highly dependent on the LLM performance and prompt design, meaning the results can be sensitive to prompt fine-tuning and the choice of LLM. To mitigate this, we designed prompts with defined expert perspectives to obtain maximum expressive diversity while maintaining clinical accuracy.

5.2. Ensuring Clinical Accuracy

We placed strong emphasis on ensuring the clinical accuracy of the generated augmented text. To achieve this, we designed clear, expert perspective-based prompts that allowed us to obtain diverse expressions while preserving the clinical meaning of the original reports as much as possible. Specifically, we clearly distinguished the viewpoints of various specialists to reflect the clinical context of each perspective accurately.

5.3. Computational Efficiency and Real-World Applicability

Since we utilized a medium-to-large LLM like Phi-4 for data augmentation, there may be constraints in terms of hardware resources and computational costs that affect real-world applicability. To mitigate this, we streamlined the augmentation process in this study by designing efficient prompts and setting optimal generation parameters.

5.4. Limitations Related to Data Source and Generalizability

This study used ECG interpretation reports obtained exclusively from a single institution—Hallym University Chuncheon Sacred Heart Hospital—which may limit the generalizability of the findings. The original reports in this dataset share a uniform machine-generated style and institution-specific terminology, which may have influenced both the augmentation process and the downstream classification performance. Moreover, the demographic and clinical characteristics of the patient population may differ from those in other hospitals, such as age distributions, prevalence of specific cardiac conditions, or the frequency of comorbidities.
To address these limitations, future research will focus on building a multi-center dataset using ECG interpretation reports collected from other regional hospitals within the Hallym University Medical Center network. By incorporating data from multiple hospitals that differ in patient demographics, comorbidity patterns, and reporting conventions, the augmentation model can be adapted to capture a broader range of vocabulary, abbreviations, and stylistic variations. The same augmentation–classification pipeline will then be applied to each hospital’s dataset to evaluate performance in both within-institution and cross-institution scenarios. This multi-center validation will help ensure that the proposed approach remains robust and clinically reliable across diverse real-world environments.

5.5. Incorporating Explainable AI for Model Interpretability

In addition to improving quantitative performance, it is crucial to ensure that the proposed classification model makes predictions based on clinically meaningful features. To this end, we incorporated eXplainable AI (XAI) techniques, specifically Local Interpretable Model-agnostic Explanation (LIME), to analyze word-level contributions to model predictions [33].
Figure 8 present representative examples of LIME visualizations for ECG interpretation texts classified as either Normal or Stroke. The highlighted words indicate how the model weighted specific terms when making its decision. For instance, in a Normal case, terms such as “normal” and “ECG” were strongly associated with the Normal class, while words like “sinus” showed weaker associations. Conversely, in a Stroke case, terms such as “ST depression”, “junctional”, and “variant” were highlighted as contributing to the Stroke classification, whereas the repeated occurrence of “normal” still influenced the model toward the Normal class.
These visualizations demonstrate that the model distributes importance across multiple clinically relevant terms, rather than relying on a single lexical cue. This balanced attribution pattern suggests that the model captures broader contextual meaning, which aligns more closely with expert reasoning in clinical practice. Moreover, such explanations help identify potential biases—for example, the overemphasis of the word “normal” in specific Stroke predictions—which can guide future refinements of the augmentation strategy and classification model.
By integrating LIME-based interpretability into our workflow, we ensure not only high predictive performance but also transparency in the decision-making process. Future work will expand this effort by incorporating additional XAI techniques such as SHAP, which can provide complementary insights into feature contributions at both the local and global levels.

6. Conclusions

We proposed an LLM-based data augmentation technique that rewrites ECG interpretation reports from various expert perspectives to expand scarce text data in the medical domain, thereby improving the generalization performance of classification models. The data generated through our proposed method secured expressive diversity while maintaining the original clinical meaning. Furthermore, when classifying for the presence of a stroke with a BiomedBERT-based Gradient Boosting Classification Model, we confirmed a significant performance improvement compared to the model trained only on the original reports.
For future work, it will be necessary to employ more diverse prompt strategies. Specifically, we could increase the number of experts for rewriting or reflect the perspectives of various stakeholders. Additionally, to systematically verify the clinical validity of the generated text, we must conduct qualitative reviews by actual clinical experts to evaluate factors such as diagnostic consistency between the generated and original sentences and the naturalness of the expressions. To enhance applicability in realistic clinical settings, we must also establish strategies to reduce memory usage and inference time by applying lightweight LLMs instead of larger models like Phi-4.

Author Contributions

Conceptualization, Y.-S.K. and C.K.; methodology, Y.-H.K.; formal analysis, Y.-H.K.; resources, Y.-S.K.; data curation, C.K.; writing—original draft preparation, Y.-H.K.; writing—review and editing, Y.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by an Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MIST) [RS-2021-II212068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)] and a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2022-NR070859).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board at Chuncheon Sacred Heart Hospital (IRB No. 2021-07-009).

Informed Consent Statement

Patient consent was waived because informed consent was already obtained as part of the Institutional Review Board approval process.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

This section describes the prompts used to augment text data.
Table A1. Prompts used for Multi-Expert Perspective Augmentation.
Table A1. Prompts used for Multi-Expert Perspective Augmentation.
Prompt
{“role”: “system”, “content”: “””
You are a medical expert specializing in ECG interpretation. Your task is to rewrite an ECG report in a concise and machine-style format.\n\n
**Instructions:**\n
- Rewrite the report into five distinct versions.\n
- Each version should reflect the perspective of a different specialist: cerebrovascular specialist, neurologist, cardiologist, radiologist, and circulatory specialist.
- Use medically appropriate but concise phrasing.
- Avoid redundant explanations or detailed speculation.
- Do not explicitly state whether the patient has had a stroke.
- Keep the style aligned with machine-generated reports: brief, objective, and direct.\n\n
- **Output must be in valid JSON format only. Any deviation is not allowed.**\n\n
**Format:**\n
{“Cerebrovascular Specialist”: “<Report>”,
“Neurologist”: “<Report>”,
“Cardiologist”: “<Report>”,
“Radiologist”: “<Report>”,
“Circulatory Specialist”: “<Report>”
}\n”
“””},
{“role”: “user”, “content”: f”Here is the original ECG interpretation:\n\n\”{original_report}\”\n\n”
“Please generate five distinct variations as per the given instructions.”}
Table A2. Prompts used for CoTAM.
Table A2. Prompts used for CoTAM.
Prompt
{“role”: “system”, “content”: “You are a helpful assistant that augments ECG interpretation data based on author perspective changes.”},
{“role”: “user”, “content”: f“““
\“{original_report}\”
Please think step by step:
1. What are some other attributes of the above sentence except “Author: Machine”?
2. How to rewrite this sentence with the same attributes, but with “Author: {author}”?
3. The rewritten sentence should remain concise, retain the machine-generated format,
and reflect the perspective of the new author. Write only the final sentence without any explanation.
”””}
Table A3. Prompts used for DA.
Table A3. Prompts used for DA.
Prompt for RewritePrompt for Generation
{“role”: “user”,
“content”: (“Rewrite the following medical report in a different way without changing its meaning.\n”
“- Output only the rewritten sentence. Do not include any explanation, heading, or prefix.\n”
“- Do not use quotation marks or markdown.\n”
“- Keep the output medically consistent.\n\n”
f”{original_text}”)}
{“role”: “user”,
“content”: (f”You are a clinical report writer. Your task is to write a short ECG report that would likely belong to class {label}.\n”
f”- Write a report of about {max_words} words.\n”
“- The content must be medically realistic.\n”
“- Output only the report text. Do not include any explanation, prefix, quotes, or markdown.\n”
“- Output should be a single paragraph.”)}

Appendix A.2

This section presents the LIME visualization results for representative Normal and Stroke samples from the proposed model, AdaBoost, and XGBoost models’ predictions.

References

  1. Chen, X.; Du, Y. Enhancing Medical Text Classification with GAN-Based Data Augmentation and Multi-Task Learning in BERT. Sci. Rep. 2025, 15, 13854. [Google Scholar] [CrossRef] [PubMed]
  2. Amin-Nejad, A.; Ive, J.; Velupillai, S. Exploring Transformer Text Generation for Medical Dataset Augmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; European Language Resources Association: Marseille, France, 2020; pp. 4699–4708. [Google Scholar]
  3. Sufi, F. Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction. Information 2024, 15, 264. [Google Scholar] [CrossRef]
  4. Bayer, M.; Kaufhold, M.-A.; Reuter, C. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. 2023, 55, 1–39. [Google Scholar] [CrossRef]
  5. Van Nooten, J.; Daelemans, W. Improving Dutch Vaccine Hesitancy Monitoring via Multi-Label Data Augmentation with GPT-3. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, 14 July 2023; Barnes, J., De Clercq, O., Klinger, R., Eds.; Toronto, ON, Canada; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 251–270. [Google Scholar]
  6. Lu, Q.; Dou, D.; Nguyen, T.H. Textual Data Augmentation for Patient Outcomes Prediction. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 2817–2821. [Google Scholar]
  7. Bird, J.J.; Pritchard, M.; Fratini, A.; Ekárt, A.; Faria, D.R. Synthetic Biological Signals Machine-Generated by GPT-2 Improve the Classification of EEG and EMG Through Data Augmentation. IEEE Robot. Autom. Lett. 2021, 6, 3498–3504. [Google Scholar] [CrossRef]
  8. Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripi, M.; Kauffmann, P.; et al. Phi-4 Technical Report. arXiv 2024, arXiv:2412.08905. [Google Scholar] [CrossRef]
  9. Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthc. 2021, 3, 1–23. [Google Scholar] [CrossRef]
  10. Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 6382–6388. [Google Scholar]
  11. Huong, T.H.; Hoang, V.T. A Data Augmentation Technique Based on Text for Vietnamese Sentiment Analysis. In Proceedings of the 11th International Conference on Advances in Information Technology, IAIT ’20, Bangkok, Thailand, 1–3 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
  12. Qiu, S.; Xu, B.; Zhang, J.; Wang, Y.; Shen, X.; de Melo, G.; Long, C.; Li, X. EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks. In Companion Proceedings of the Web Conference 2020; WWW ’20; Association for Computing Machinery: New York, NY, USA, 2020; pp. 249–252. [Google Scholar]
  13. Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; Erk, K., Smith, N.A., Eds.; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 86–96. [Google Scholar]
  14. Wu, X.; Lv, S.; Zang, L.; Han, J.; Hu, S. Conditional BERT Contextual Augmentation. In Computational Science–ICCS 2019; Rodrigues, J.M.F., Cardoso, P.J.S., Monteiro, J., Lam, R., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J.J., Sloot, P.M.A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 84–95. [Google Scholar]
  15. Shi, B.; Zhang, L.; Huang, J.; Zheng, H.; Wan, J.; Zhang, L. MDA: An Intelligent Medical Data Augmentation Scheme Based on Medical Knowledge Graph for Chinese Medical Tasks. Appl. Sci. 2022, 12, 10655. [Google Scholar] [CrossRef]
  16. Xiang, R.; Chersoni, E.; Lu, Q.; Huang, C.-R.; Li, W.; Long, Y. Lexical Data Augmentation for Sentiment Analysis. J. Assoc. Inf. Sci. Technol. 2021, 72, 1432–1447. [Google Scholar] [CrossRef]
  17. Kumar, A.; Bhattamishra, S.; Bhandari, M.; Talukdar, P. Submodular Optimization-Based Diverse Paraphrasing and Its Effectiveness in Data Augmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 3609–3619. [Google Scholar]
  18. Liu, T.; Sun, Y. End-to-End Adversarial Sample Generation for Data Augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 11359–11368. [Google Scholar]
  19. Piedboeuf, F.; Langlais, P. Is ChatGPT the Ultimate Data Augmentation Algorithm? In Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 15606–15615. [Google Scholar]
  20. Ubani, S.; Polat, S.O.; Nielsen, R. ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. arXiv 2023, arXiv:2304.14334. [Google Scholar]
  21. Zhao, H.; Chen, H.; Ruggles, T.A.; Feng, Y.; Singh, D.; Yoon, H.-J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics 2024, 13, 2535. [Google Scholar] [CrossRef]
  22. Peng, L.; Zhang, Y.; Shang, J. Controllable Data Augmentation for Few-Shot Text Mining with Chain-of-Thought Attribute Manipulation. In Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 1–16. [Google Scholar]
  23. Liu, Y.; Sharma, P.; Oswal, M.J.; Xia, H.; Huang, Y. PersonaFlow: Boosting Research Ideation with LLM-Simulated Expert Personas. arXiv 2024, arXiv:2409.12538v1. [Google Scholar]
  24. Li, Z.; Chang, Y.; Le, X. Simulating Expert Discussions with Multi-Agent for Enhanced Scientific Problem Solving. In Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024), Bangkok, Thailand, 16 August 2024; Ghosal, T., Singh, A., Waard, A., Mayr, P., Naik, A., Weller, O., Lee, Y., Shen, S., Qin, Y., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 243–256. [Google Scholar]
  25. Liu, J.; Koopman, B.; Brown, N.J.; Chu, K.; Nguyen, A. Generating Synthetic Clinical Text with Local Large Language Models to Identify Misdiagnosed Limb Fractures in Radiology Reports. Artif. Intell. Med. 2025, 159, 103027. [Google Scholar] [CrossRef] [PubMed]
  26. Wei, Y.; Li, Q.; Pillai, J. Structured LLM Augmentation for Clinical Information Extraction. Stud. Health Technol. Inform. 2025, 329, 971–976. [Google Scholar] [PubMed]
  27. Šuvalov, H.; Lepson, M.; Kukk, V.; Malk, M.; Ilves, N.; Kuulmets, H.-A.; Kolde, R. Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study. J. Med. Internet Res. 2025, 27, e66279. [Google Scholar] [CrossRef] [PubMed]
  28. Badirli, S.; Liu, X.; Xing, Z.; Bhowmik, A.; Doan, K.; Keerthi, S.S. Gradient Boosting Neural Networks: GrowNet. arXiv 2020, arXiv:2002.07971. [Google Scholar] [CrossRef]
  29. Naver-Hyperclovax. HyperCLOVAX-SEED-Vision-Instruct-3B. Hugging Face. Available online: https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B (accessed on 17 July 2025).
  30. Google. Gemma Kaggle. Available online: https://www.kaggle.com/models/google/gemma-2 (accessed on 17 July 2025).
  31. Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  32. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; KDD ’16, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
  33. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; KDD ’16, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
Figure 1. The proposed method. First, we input an original machine generated ECG report into a Large Language Model (LLM) with the instructions to “Rewrite from the perspective of five experts.” The LLM then generated five rewritten texts, each reflecting a specialist’s viewpoint. We used these generated interpretations to build the ‘Generated ECG Interpretation Dataset.’ We then trained a BiomedBERT-based Gradient Boosting Classification Model using both the original and the generated datasets. Subsequently, we selected the threshold that achieves the highest F1-score on the validation set as the optimal threshold. Finally, we applied this optimal threshold to perform binary classification to determine the presence of a stroke.
Figure 1. The proposed method. First, we input an original machine generated ECG report into a Large Language Model (LLM) with the instructions to “Rewrite from the perspective of five experts.” The LLM then generated five rewritten texts, each reflecting a specialist’s viewpoint. We used these generated interpretations to build the ‘Generated ECG Interpretation Dataset.’ We then trained a BiomedBERT-based Gradient Boosting Classification Model using both the original and the generated datasets. Subsequently, we selected the threshold that achieves the highest F1-score on the validation set as the optimal threshold. Finally, we applied this optimal threshold to perform binary classification to determine the presence of a stroke.
Applsci 15 09376 g001
Figure 2. The upper panel shows a lead I ECG signal from a patient with stroke, and the lower panel shows a lead I ECG signal from a healthy subject. These visualizations illustrate the differences in raw ECG waveforms between the two classes used in this study.
Figure 2. The upper panel shows a lead I ECG signal from a patient with stroke, and the lower panel shows a lead I ECG signal from a healthy subject. These visualizations illustrate the differences in raw ECG waveforms between the two classes used in this study.
Applsci 15 09376 g002
Figure 3. (a) GrowNet architecture. It consists of weak learners based on a multi-layer perceptron. After the first weak learner, the model combines the input with the second-to-last feature of the weak learner for training. (b) Structure of the BiomedBERT-based Gradient Boosting Classification Model. It consists of weak learners based on BiomedBERT.
Figure 3. (a) GrowNet architecture. It consists of weak learners based on a multi-layer perceptron. After the first weak learner, the model combines the input with the second-to-last feature of the weak learner for training. (b) Structure of the BiomedBERT-based Gradient Boosting Classification Model. It consists of weak learners based on BiomedBERT.
Applsci 15 09376 g003
Figure 4. Comparison of classification performance between the model trained solely on original ECG interpretation data and the model trained on a combination of original and augmented reports generated through the proposed Multi-Expert Perspective Augmentation method. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Figure 4. Comparison of classification performance between the model trained solely on original ECG interpretation data and the model trained on a combination of original and augmented reports generated through the proposed Multi-Expert Perspective Augmentation method. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Applsci 15 09376 g004
Figure 5. Comparison of classification performance across different augmentation techniques: Chain-of-Thought Attribute Manipulation (CoTAM), LLM-based Data Augmentation (DA), and the proposed Multi-Expert Perspective Augmentation. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Figure 5. Comparison of classification performance across different augmentation techniques: Chain-of-Thought Attribute Manipulation (CoTAM), LLM-based Data Augmentation (DA), and the proposed Multi-Expert Perspective Augmentation. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Applsci 15 09376 g005
Figure 6. Comparison of classification performance using augmented datasets generated by different large language models (LLMs): HyperCLOVA X Seed 3B, Gemma-2 9B, and Phi-4 14B. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Figure 6. Comparison of classification performance using augmented datasets generated by different large language models (LLMs): HyperCLOVA X Seed 3B, Gemma-2 9B, and Phi-4 14B. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Applsci 15 09376 g006
Figure 7. Comparison of classification performance between the proposed BiomedBERT-based Gradient Boosting model and traditional boosting methods (AdaBoost and XGBoost). Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Figure 7. Comparison of classification performance between the proposed BiomedBERT-based Gradient Boosting model and traditional boosting methods (AdaBoost and XGBoost). Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Bars represent mean values, with error bars indicating 95% confidence intervals.
Applsci 15 09376 g007
Figure 8. LIME visualizations for representative Stroke (upper) and Normal (lower) samples. The upper panel shows a Stroke sample. Words such as “ST depression”, “junctional”, and “variant” contributed strongly to the Stroke classification, although repeated mentions of “normal” partially influenced the prediction toward the Normal class. The lower panel shows a Normal sample. In this case, terms such as “normal” and “ECG” contributed positively to the Normal class, while “sinus” was weakly associated with Stroke. The color of each highlighted word indicates its contribution to the final prediction: blue for the ’Normal’ class and orange for the ’Stroke’ class. The intensity of the color corresponds to the magnitude of the contribution.
Figure 8. LIME visualizations for representative Stroke (upper) and Normal (lower) samples. The upper panel shows a Stroke sample. Words such as “ST depression”, “junctional”, and “variant” contributed strongly to the Stroke classification, although repeated mentions of “normal” partially influenced the prediction toward the Normal class. The lower panel shows a Normal sample. In this case, terms such as “normal” and “ECG” contributed positively to the Normal class, while “sinus” was weakly associated with Stroke. The color of each highlighted word indicates its contribution to the final prediction: blue for the ’Normal’ class and orange for the ’Stroke’ class. The intensity of the color corresponds to the magnitude of the contribution.
Applsci 15 09376 g008
Table 1. The stroke example includes findings such as “Normal sinus rhythm, Inferior infarct, age undetermined, Abnormal ECG”, whereas the normal example contains standard phrases such as “Normal sinus rhythm, Normal ECG.” These examples demonstrate the concise and structured style of machine-generated reports used in this study.
Table 1. The stroke example includes findings such as “Normal sinus rhythm, Inferior infarct, age undetermined, Abnormal ECG”, whereas the normal example contains standard phrases such as “Normal sinus rhythm, Normal ECG.” These examples demonstrate the concise and structured style of machine-generated reports used in this study.
NormalStroke
Normal sinus rhythm Normal ECGNormal sinus rhythm Inferior infarct, age undetermined Abnormal ECG
Table 2. Example of the augmented text.
Table 2. Example of the augmented text.
Text TypeText
Original textSinus tachycardia Low voltage QRS Borderline ECG
Cerebrovascular SpecialistSinus tachycardia observed. Notable low voltage QRS amplitude, suggesting potential cardiac involvement or volume depletion. ECG borderline, further cerebrovascular evaluation may be warranted.
NeurologistSinus tachycardia detected. Low voltage QRS complex indicates possible neurological-cardiac interaction. Borderline ECG findings, recommend monitoring for related neurological implications.
CardiologistPresentation of sinus tachycardia with low voltage QRS complexes. ECG findings are borderline, indicating a need for further cardiac assessment and potential imaging.
RadiologistSinus tachycardia with low voltage QRS complex. ECG borderline, suggesting need for diagnostic imaging to evaluate cardiac structure and possible underlying conditions.
Circulatory SpecialistSinus tachycardia identified. Low voltage QRS complexes and borderline ECG may indicate circulatory system disturbances. Further hemodynamic evaluation recommended.
Table 3. Comparison of classification performance between the model trained solely on original ECG interpretation data and the model trained on original data augmented with expert perspective rewritten texts (proposed method). Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Values in parentheses indicate the 95% confidence interval.
Table 3. Comparison of classification performance between the model trained solely on original ECG interpretation data and the model trained on original data augmented with expert perspective rewritten texts (proposed method). Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Values in parentheses indicate the 95% confidence interval.
DatasetAccuracyPrecisionRecallF1-ScoreAUROC
Original Data0.5496 (0.4953–0.6012)0.5549 (0.5000–0.6062)0.9827 (0.9609–1.0000)0.7089 (0.6625–0.7510)0.6290 (0.5675–0.6911)
+ Expert Augmentation (Ours)0.8434 (0.8037–0.8816)0.8373 (0.7857–0.8876)0.8935 (0.8491–0.9351)0.8642 (0.8264–0.8985)0.9025 (0.8690–0.9319)
Table 4. Classification performance of models trained on datasets generated using different augmentation techniques: Chain-of-Thought Attribute Manipulation (CoTAM), LLM-based Data Augmentation (DA), and the proposed Multi-Expert Perspective Augmentation. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Values in parentheses indicate the 95% confidence interval.
Table 4. Classification performance of models trained on datasets generated using different augmentation techniques: Chain-of-Thought Attribute Manipulation (CoTAM), LLM-based Data Augmentation (DA), and the proposed Multi-Expert Perspective Augmentation. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Values in parentheses indicate the 95% confidence interval.
Augmentation TechniquesAccuracyPrecisionRecallF1-ScoreAUROC
CoTAM
[22]
0.7653 (0.7165–0.8100)0.7359 (0.6786–0.7917)0.9052 (0.8587–0.9441)0.8114 (0.7684–0.8523)0.7950 (0.7391–0.8395)
DA
[21]
0.7484 (0.6978–0.7944)0.7920 (0.7262–0.8483)0.7460 (0.6793–0.8103)0.7678 (0.7138–0.8134)0.8887 (0.8499–0.9211)
Expert
(Ours)
0.8434 (0.8037–0.8816)0.8373 (0.7857–0.8876)0.8935 (0.8491–0.9351)0.8642 (0.8264–0.8985)0.9025 (0.8690–0.9319)
Table 5. Classification performance of models trained on datasets augmented using different large language models, HyperCLOVA X Seed 3B, Gemma-2 9B, and Phi-4 14B, all following the Multi-Expert Perspective Augmentation approach. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Values in parentheses indicate the 95% confidence interval.
Table 5. Classification performance of models trained on datasets augmented using different large language models, HyperCLOVA X Seed 3B, Gemma-2 9B, and Phi-4 14B, all following the Multi-Expert Perspective Augmentation approach. Evaluation metrics include accuracy, precision, recall, F1-score, and AUROC. Values in parentheses indicate the 95% confidence interval.
ModelAccuracyPrecisionRecallF1-ScoreAUROC
HyperCLOVA X Seed 3B
[29]
0.7137 (0.6604–0.7602)0.7431 (0.6743–0.8022)0.7458 (0.6833–0.8125)0.7439 (0.6869–0.7926)0.8600 (0.8189–0.8984)
Gemma-2 9B
[30]
0.8344 (0.7944–0.8723)0.8245 (0.7704–0.8733)0.8943 (0.8483–0.9344)0.8576 (0.8195–0.8911)0.8745 (0.8365–0.9089)
Phi-4 14B
[8]
0.8434 (0.8037–0.8816)0.8373 (0.7857–0.8876)0.8935 (0.8491–0.9351)0.8642 (0.8264–0.8985)0.9025 (0.8690–0.9319)
Table 6. Classification performance of the proposed BiomedBERT-based Gradient Boosting model compared to traditional ensemble methods AdaBoost and XGBoost. Metrics include accuracy, precision, recall, F1-score, and AUROC for stroke prediction from ECG reports. Values in parentheses indicate the 95% confidence interval.
Table 6. Classification performance of the proposed BiomedBERT-based Gradient Boosting model compared to traditional ensemble methods AdaBoost and XGBoost. Metrics include accuracy, precision, recall, F1-score, and AUROC for stroke prediction from ECG reports. Values in parentheses indicate the 95% confidence interval.
ModelAccuracyPrecisionRecallF1-ScoreAUROC
AdaBoost0.5386 (0.4825–0.5944)0.7847 (95% CI: 0.5383–1.0000)0.0794 (95% CI: 0.0362–0.1259)0.1434 (95% CI: 0.0685–0.2180)0.5319 (95% CI: 0.4588–0.6004)
XGBoost0.5836 (95% CI: 0.5280–0.6399)0.9184 (95% CI: 0.7894–1.0000)0.1639 (95% CI: 0.1026–0.2333)0.2769 (95% CI: 0.1840–0.3736)0.5312 (95% CI: 0.4583–0.6002)
Ours0.8434 (0.8037–0.8816)0.8373 (0.7857–0.8876)0.8935 (0.8491–0.9351)0.8642 (0.8264–0.8985)0.9025 (0.8690–0.9319)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, Y.-H.; Kim, C.; Kim, Y.-S. One Report, Multifaceted Views: Multi-Expert Rewriting for ECG Interpretation. Appl. Sci. 2025, 15, 9376. https://doi.org/10.3390/app15179376

AMA Style

Kim Y-H, Kim C, Kim Y-S. One Report, Multifaceted Views: Multi-Expert Rewriting for ECG Interpretation. Applied Sciences. 2025; 15(17):9376. https://doi.org/10.3390/app15179376

Chicago/Turabian Style

Kim, Yu-Hyeon, Chulho Kim, and Yu-Seop Kim. 2025. "One Report, Multifaceted Views: Multi-Expert Rewriting for ECG Interpretation" Applied Sciences 15, no. 17: 9376. https://doi.org/10.3390/app15179376

APA Style

Kim, Y.-H., Kim, C., & Kim, Y.-S. (2025). One Report, Multifaceted Views: Multi-Expert Rewriting for ECG Interpretation. Applied Sciences, 15(17), 9376. https://doi.org/10.3390/app15179376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop