Next Article in Journal
Understanding What the Brain Sees: Semantic Recognition from EEG Responses to Visual Stimuli Using Transformer
Previous Article in Journal
“In Metaverse Cryptocurrencies We (Dis)Trust?”: Mediators and Moderators of Blockchain-Enabled Non-Fungible Token (NFT) Adoption in AI-Powered Metaverses
Previous Article in Special Issue
End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Development and Validation of a Questionnaire to Evaluate AI-Generated Summaries for Radiologists: ELEGANCE (Expert-Led Evaluation of Generative AI Competence and ExcelleNCE)

by
Yuriy A. Vasilev
,
Anton V. Vladzymyrskyy
,
Olga V. Omelyanskaya
,
Yulya A. Alymova
,
Dina A. Akhmedzyanova
,
Yuliya F. Shumskaya
,
Maria R. Kodenko
,
Ivan A. Blokhin
* and
Roman V. Reshetnikov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department, 127051 Moscow, Russia
*
Author to whom correspondence should be addressed.
AI 2025, 6(11), 287; https://doi.org/10.3390/ai6110287
Submission received: 19 September 2025 / Revised: 10 October 2025 / Accepted: 3 November 2025 / Published: 5 November 2025

Abstract

Background/Objectives: Large language models (LLMs) are increasingly considered for use in radiology, including the summarization of patient medical records to support radiologists in processing large volumes of data under time constraints. This task requires not only accuracy and completeness but also clinical applicability. Automatic metrics and general-purpose questionnaires fail to capture these dimensions, and no standardized tool currently exists for the expert evaluation of LLM-generated summaries in radiology. Here, we aimed to develop and validate such a tool. Methods: Items for the questionnaire were formulated and refined through focus group testing with radiologists. Validation was performed on 132 LLM-generated summaries of 44 patient records, each independently assessed by radiologists. Criterion validity was evaluated through known-group differentiation and construct validity through confirmatory factor analysis. Results: The resulting seven-item instrument, ELEGANCE (Expert-Led Evaluation of Generative AI Competence and Excellence), demonstrated excellent internal consistency (Cronbach’s α = 0.95). It encompasses seven dimensions: relevance, completeness, applicability, falsification, satisfaction, structure, and correctness of language and terminology. Confirmatory factor analysis supported a two-factor structure (content and form), with strong fit indices (RMSEA = 0.079, CFI = 0.989, TLI = 0.982, SRMR = 0.029). Criterion validity was confirmed by significant between-group differences (p < 0.001). Conclusions: ELEGANCE is the first validated tool for expert evaluation of LLM-generated medical record summaries for radiologists, providing a standardized framework to ensure quality and clinical utility.

1. Introduction

Modern large language models (LLMs) are increasingly applied in radiology to address a wide range of tasks. They have been used to simplify radiological texts for patients, identify pathological signs in imaging studies, support differential diagnosis, extract structured data from reports, generate radiology reports, classify studies according to diagnostic scales, automate data processing and interpretation, predict disease outcomes, and detect errors in radiology reports [1]. LLMs are also being explored for summarizing medical data [2]—a task of particular importance given the need for radiologists to process vast amounts of patient information within limited time frames.
Working with medical documentation has distinct challenges, particularly for radiologists who report remotely and may not have direct contact with either patients or referring physicians [3]. Summaries of medical histories must clearly reflect the diagnostic purpose of the study and the characteristics of the patient. This imposes additional requirements not only on the quality of summaries generated by LLMs—ensuring they are relevant, accurate, and complete—but also on the quality assessment tools used to evaluate them.
Structured summaries produced by LLMs have the potential to substantially optimize radiologists’ workflows by reducing the time spent reviewing medical records. Some studies suggest that LLMs can generate summaries with efficiency comparable to human experts [4,5]. However, as Tang et al. have shown, the use of LLMs in medicine carries serious risks: models are prone to factual errors, distorted conclusions, and omissions of clinically significant details [6]. Particularly concerning is their capacity to produce text that appears plausible but is unsupported by the underlying data—so-called “hallucinations”—which can pose risks to patient safety [7]. This stems from the fact that even when generating coherent and contextually appropriate text, LLMs do so without genuine semantic understanding of the source material.
The performance of LLMs is often evaluated using statistical text-similarity metrics, such as ROUGE-1, ROUGE-2, ROUGE-L [8], BLEU [9], and METEOR [10]. More recently, BERTScore, which measures semantic similarity using vector representations derived from pre-trained BERT or related models, has also been adopted [11]. These approaches, however, rely on comparison with reference summaries, and their values are highly sensitive to lexical and semantic variations that may not critically affect meaning. Moreover, such metrics cannot assess the clinical applicability or reliability of summaries—qualities that are essential in a medical context [12,13]. For this reason, expert evaluation by medical professionals remains the benchmark for assessing LLM outputs, including medical summarization tasks [14,15].
To date, no single standardized approach exists for expert evaluation of LLM-generated medical text. In response, Tam et al. proposed the QUEST framework (Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, Trust and Confidence), which outlines principles for human evaluation of LLM performance [16]. Each principle includes specific dimensions, such as applicability and relevance, which can be combined in different ways to inform expert assessment tools tailored to particular scenarios.
One such medical evaluation tool is the Quality Analysis of Medical Artificial Intelligence (QAMAI) questionnaire, comprising six items rated on a five-point Likert scale [17]. Although developed independently of QUEST, QAMAI includes overlapping dimensions such as accuracy, clarity, relevance, completeness, source availability, and usefulness. It was designed to assess LLM responses to questions posed by both specialist physicians (e.g., maxillofacial surgeons, otolaryngologists, and head and neck surgeons) and their patients, with validation conducted by experts in those fields. However, QAMAI has notable limitations: it lacks assessment of key aspects such as physicians’ satisfaction with, or trust in, LLM outputs and does not account for the presence of artificial intelligence (AI) hallucinations. Furthermore, it was designed for evaluating answers to medical questions rather than summarization tasks, particularly those specific to radiology.
The absence of a specialized, robust questionnaire for expert assessment of LLM-generated summaries of patient medical data remains a critical gap. The objective of this study was therefore to develop and validate a questionnaire tailored to the expert evaluation of LLM-based summarization of medical records for radiologists.

2. Materials and Methods

The design, development, and validation process of the questionnaire is summarized in Figure 1. This study was conducted in three stages.

2.1. Study Participants

The working group responsible for questionnaire development comprised three clinicians, three radiologists, and two specialists with expertise in AI.
The panel of experts included five radiologists and three AI specialists. A focus group of 10 radiologists working in teleradiology [3] contributed to item refinement. Pilot testing, as well as assessments of reliability and validity, were conducted by radiologists working in teleradiology [3].
All participating radiologists had a minimum of three years’ professional experience.

2.2. Stage One: Preparatory

Based on the QUEST framework [16], we identified key dimensions relevant to expert evaluation of LLM-generated medical data summaries. The dimension selection process is illustrated in Figure 2.
The final questionnaire comprises the following dimensions:
-
Relevance: alignment of the response with the query;
-
Completeness: inclusion of all necessary information;
-
Applicability: the practical value of the response to the user;
-
Falsification: presence of false information and/or hallucinations;
-
Satisfaction: the respondent’s overall satisfaction with the quality of the summary;
-
Structure: presentation of the text as a coherent and logically organized unit;
-
Correctness of language and terminology: accuracy of language and terminology.
Prior studies highlight the relationship between response format [18], namely structure [19] and grammatical errors [20], and its perception by human readers. We added the dimensions of structure and correctness of language and terminology, expanding the QUEST framework.

2.3. Stage Two: Creating a Questionnaire

The initial questions were drafted for each questionnaire dimension and subsequently refined through in-depth interviews with experts, which confirmed their accuracy and led to adjustments in two new items.
The assessment options were taken from the previously mentioned work of Tam et al. [16]. The dimension Falsification was assessed using a binary yes/no scale, whereas the remaining dimensions employed a five-point Likert scale, selected for its advantages over alternative options [21,22].
The preliminary questionnaire was then tested in a focus group to evaluate usability, clarity of wording, and appropriateness of response options. The focus group participants were offered two answer options on the Likert scale:
(1)
Textual descriptors assigned to each score;
(2)
Textual descriptors provided only for the polar responses, with intermediate scores chosen by proximity.
The first version, with descriptors for all response categories, was judged most convenient (7 votes vs. 3 votes).
Pilot testing was subsequently performed to assess the reliability of the developed tool. Participants were provided with patient medical records containing standard sections (demographics, medical history, complaints, laboratory and instrumental data, etc.). Summaries were automatically generated using DeepSeek Chat (DeepSeek-V3, DeepSeek AI) with default settings, prompted by the request: “Summarize the medical records of the patient referred for an abdominal CT scan to facilitate preparation of the radiology report.”
The generated summaries were manually checked for fidelity to the original content by members of the working group and, if necessary, corrected. Radiologists participating in the pilot study then evaluated the summaries using the questionnaire. To assess test–retest reliability, each participant completed the questionnaire twice with a 10–14 day washout period, using the same “medical records–summary” pair on both occasions.

2.4. Stage Three: Questionnaire Validation

Following pilot testing, a validation study was conducted to assess the questionnaire’s reliability and validity. The validation study design is shown in Figure 3.
A total of 44 medical records were used in the validation study. Based on the medical records, the structure of which is presented on the top of Figure 3, a physician performed summaries of three quality categories for a single prompt using DeepSeek Chat. The prompt was developed by consensus of the working group and focus group members:
“The patient was referred for an abdominal CT scan. Extract from the provided text the information that will be useful for a radiologist to prepare the abdominal CT report, following the scheme below:
-
Complaints prompting the abdominal CT scan.
-
Medical history.
-
Patient medical history (comorbidities, lifestyle factors, family history, surgeries).
-
Laboratory findings.
-
Instrumental findings.
Do not include in your response data that is not relevant to the preparation of the abdominal CT radiology report.”
The primary goal was to generate a good-quality summary—fully consistent with the original data, error-free, clearly structured, and emphasizing information relevant to the radiologist. From this, a moderate-quality summary was derived by omitting certain details and adding redundant information, without a significant distortion of the meaning. A poor-quality summary was created by excluding critical information, introducing factual inaccuracies, and presenting the text in an unstructured format. All summaries were reviewed and, if necessary, revised by a physician.
Each summary was then verified by a radiologist to ensure consistency with the source medical records and clinical relevance of the information. Where necessary, the radiologist independently corrected the text.
An example of the original medical record data and the corresponding summaries of differing quality is provided in Appendix A, Table A1.
In the final step, the verified summaries, together with the source records and the questionnaire, were distributed to validation study participants for assessment.

2.5. Sample Size Calculation for the Validation Study

The minimum sample size for estimating a latent variable based on six observed variables (alpha 0.05, power 0.8) was 200 estimates [23]. The sample size calculation formulas are provided in the Supplementary File. The target number of summary estimates was adjusted to 300 as recommended by Comrey and Lee, 1992 [24].

2.6. Statistical Data Analysis

The reliability of the questionnaire was assessed through test–retest reliability and internal consistency (Table 1). Test–retest reliability was evaluated by calculating rank correlations between the initial and repeated completion of the questionnaire, separated by a 10–14 day interval. Internal consistency was determined using Cronbach’s α.
As no established instruments exist for expert evaluation of summaries, the validity of the tool was assessed using differentiation by known groups. Since the experts made summaries of good, moderate, and poor quality for each medical record, the tool validity was tested by examining differences in mean scores across these quality groups. Construct validity was further evaluated by confirmatory factor analysis.

3. Results

The ELEGANCE (Expert-Led Evaluation of Generative AI Competence and Excellence) questionnaire was developed to assess the performances of LLMs in summarizing medical records for radiologists (Appendix B). It consists of seven questions, with the structure shown in Table 2.

3.1. Pilot Testing

Pilot testing was conducted on a sample of 12 respondents. The distribution of questionnaire scores for the evaluated summaries is shown in Figure 4. Total scores ranged from 17 to 27, with a median of 24 (IQR 20–25).
The questionnaire showed satisfactory test–retest reliability (ρ = 0.57), which is considered a good result for instruments that may depend on respondents’ attitudes towards the subject of evaluation [27]. Pilot testing also demonstrated acceptable internal consistency (Cronbach’s α = 0.77; 95% CI, 0.61–0.86).

3.2. Validation Study

Scores from the validation study ranged from 6 to 30, with a median of 19 (IQR 13–24). The questionnaire demonstrated excellent internal consistency (Cronbach’s α = 0.95; 95% CI 0.94–0.96). Univariate nonparametric analysis of variance (Kruskal–Wallis test) showed significant differences across summary quality groups (p < 0.001), confirming the tool’s validity. Pairwise comparisons between high-, moderate-, and low-quality summaries further demonstrated significant differences in all groups (Mann–Whitney tests with Bonferroni correction; all p < 0.001).
Figure 5 presents the distribution of total scores for summaries of known quality, as assessed with the developed tool, shown as box-and-whisker plots.
Construct reliability was assessed using confirmatory factor analysis. A two-factor model was selected: “Structure” and “Correctness of language and terminology” dimensions were assessed as a separate factor (the distribution of dimensions by factors is presented in Table 3). The two-factor model demonstrated good conformity on the indices RMSEA = 0.079, CFI = 0.989, TLI = 0.982, and SRMR = 0.029. A one-factor model was tested as a basic comparative model, which demonstrated less conformity to the data (RMSEA = 0.116, CFI = 0.96, TLI = 0.97, and SRMR = 0.04). The factor loadings for the two-factor model are presented in Table 3. The factor loading for the “Falsification” dimension was 0.45, which is noticeably lower than the other dimensions; nevertheless, it indicates a significant relationship between the question and the considered factor.

4. Discussion

In this study, we developed and validated the ELEGANCE questionnaire to assess the quality of medical text summaries for radiologists generated by LLMs. Such a tool is particularly relevant given the growing use of LLMs in medicine, where the accuracy and reliability of generated text have direct implications for the medical care quality [12]. As emphasized by Tam et al. [16], expert evaluation remains the benchmark for verifying LLM performance.
The ELEGANCE questionnaire includes seven dimensions, encompassing both content (relevance, completeness, applicability, falsification, satisfaction) and formal technical aspects (structure, correctness of language and terminology) of text quality. The developed tool demonstrated excellent internal consistency (Cronbach’s α 0.95). Such a high level of reliability indicates that all items of the questionnaire consistently measure the same construct, and the variability of responses is minimally related to random errors. The obtained result exceeds the generally accepted threshold (0.7) for research tools, which allows recommending this questionnaire for use in both scientific research and practical quality assessments of the LLM performance.
Construct validity of the questionnaire was demonstrated using confirmatory factor analysis. The best fit was achieved with a two-factor model, grouping Structure and Correctness of language and terminology under the domain of formal technical text quality and the remaining dimensions under textual content quality. The model conformity indices were within the recommended values: RMSEA = 0.079 (<0.08 is considered acceptable), CFI = 0.989 (>0.95 is optimal), and SRMR = 0.029 (satisfactory <0.08). For comparison, the one-factor model demonstrated worse conformity (RMSEA = 0.116, CFI = 0.96), which confirms the validity of dividing a questionnaire into two independent but interrelated domains assessing different aspects of the summarization quality. Therefore, we have further expanded the QUEST framework [16] via two distinct categories: “Structure” and “Correctness of language and terminology”.
The dimension Falsification had a lower factor loading than the other items, yet it still demonstrated a significant association with the textual content quality domain. Possible reasons for the lower loading may be related to the binary response format, the complexity of interpretation between the actual error and hallucination. Also, the lower loading may reflect the heterogeneity of the phenomenon—from factual errors to logical distortions and incorrect generalizations. Despite this, inclusion of the Falsification dimension remains essential due to its ability to identify LLM “hallucinations” and its clinical relevance for assessing safety. The decision to employ a Likert scale was carefully considered, but ultimately decided against it due to its’ redundancy.
An important indicator of the questionnaire’s validity is its ability to differentiate summaries of varying quality. Nonparametric variance analysis revealed statistically significant differences between groups of different quality (p < 0.001). These findings indicate that the tool is highly sensitive and suitable for evaluating the quality of medical history summaries.
The closest analog to our tool is the Quality Assessment of Medical Artificial Intelligence (QAMAI) questionnaire [17], developed for expert evaluation of AI-generated medical information. QAMAI was validated on ChatGPT-4 responses in otolaryngology and maxillofacial surgery. Although ELEGANCE was designed for a different purpose, comparison is appropriate given their shared conceptual basis in the QUEST framework [16]. QAMAI demonstrated high internal consistency (α = 0.837), and factor analysis indicated a unidimensional structure (variance explained 51.1%; loadings 0.449–0.856), suggesting that the instrument functions as a single global scale of LLM response quality.
In contrast, ELEGANCE was intentionally developed as a more detailed tool, capturing both content-related and formal technical aspects of text quality across seven dimensions. The factor analysis results indicate the multidimensional nature of the responses‘ quality and allow separating the contribution of content and form to the overall assessment. QAMAI’s unidimensional structure facilitates rapid screening and model-to-model comparisons but is less informative for the targeted calibration of response weaknesses. ELEGANCE offers greater diagnostic value for identifying specific response weaknesses—for example, distinguishing when strong content is undermined by poor presentation.
Jerome R. Lechien et al. introduced the Artificial Intelligence Performance Instrument (AIPI), validated for evaluating chatbot performance using ChatGPT in clinical otolaryngology cases [28]. AIPI focuses primarily on the correctness of LLM-generated clinical decisions across components such as case features, diagnosis, additional investigations, and treatment. While ELEGANCE evaluates the quality of medical text from two points of view—content and form.
The Culturally and Linguistically Equitable AI Review (CLEAR) tool [29] was developed for a standardized assessment of the quality of medical information generated by AI, considering cultural and linguistic factors. Its structure and validity were focused on the broad applicability across a wide range of medical disciplines, which ensures high versatility but was accompanied by a generalization of criteria. In contrast, ELEGANCE was created for a highly specialized task—the expert assessment of the quality of medical history summaries intended to support a radiologist in writing a radiology report.
This study has several limitations that should be considered when interpreting the findings. Firstly, the questionnaire was validated using medical record summaries generated by a tandem of a physician and a radiologist based on the responses of a single LLM in a given application scenario (assisting a radiologist in writing a radiology report). This approach ensures a high level of the clinical and subject relevance of summaries; however, it may lead to their higher and more uniform quality compared to the fully automatic LLM generation, which potentially overestimates the questionnaire’s performance during validation. Despite that, within radiology, the ELEGANCE is broadly applicable across various subspecialties and clinical scenarios.
Secondly, the assessment was carried out by experts with specific training (radiologists), which potentially affects the reproducibility of the results when specialists from other fields of medicine participate. However, this was the study purpose—to create a specialized questionnaire.
Thirdly, although the tool demonstrated excellent internal consistency and discriminant validity, inter-expert agreement was not assessed. Separate testing is needed to confirm the stability of scores across different expert groups.
Finally, the questionnaire was developed and validated under conditions of the artificial determination of summaries’ quality (pre-known groups). This facilitated testing of discriminatory ability and provided high quantitative metrics, but does not fully reflect the complexity of real clinical scenarios, where text quality can vary more subtly and heterogeneously. Nevertheless, to validate the tool, it was essential to ensure that ELEGANCE measures what we intend to assess—specifically, the quality of LLM-generated summarization in radiology. Thus, it was necessary to provide a comprehensive range of potential summary quality levels.
The ELEGANCE questionnaire developed and validated in this study needs to be expanded in its areas of application and adapted for in-depth research in other fields of clinical medicine. An important next step will be to integrate text quality assessment with evaluations of perception and readiness to use AI in clinical practice. In this context, the previously developed and validated ATRAI-14 questionnaire [30], which measures radiologists’ attitudes towards AI in medical imaging, is of particular relevance. Combined use of these instruments would enable not only objective evaluation of LLM performance but also comparison with specialists’ expectations, trust, and readiness for implementation. Such an integrated approach can reveal a relationship between the actual characteristics of generated text models and the barriers or drivers of their clinical integration, which is especially important for developing strategies for the safe and effective implementation of AI into radiology.

5. Conclusions

The ELEGANCE questionnaire demonstrates high reliability and validity, confirming its value as a tool for assessing the quality of medical history summaries for radiologists. Its two-factor structure enables evaluation of both the content and the form of summaries. Further research may clarify the role of the Falsification dimension and expand the scope of the questionnaire application.
Our findings provide a solid foundation for the standardized approach to evaluating LLM performance in medicine. Implementation of the tool could enhance the safe and effective integration of AI technologies into healthcare by ensuring robust quality control of generated medical information. A potential use of this tool not only for auditing but also for training and fine-tuning medical LLMs is of particular value.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/ai6110287/s1, sample size calculation formulas.

Author Contributions

Conceptualization, Y.A.V., A.V.V., Y.A.A., D.A.A., Y.F.S. and R.V.R.; methodology, Y.A.V., A.V.V., Y.A.A., D.A.A., Y.F.S. and R.V.R.; formal analysis, Y.F.S., Y.A.A. and R.V.R.; investigation, Y.A.A., D.A.A. and Y.F.S.; resources, Y.A.V. and A.V.V.; data curation, Y.A.A., I.A.B. and M.R.K.; writing—original draft preparation Y.A.A., D.A.A., I.A.B., Y.F.S. and R.V.R.; writing—review and editing, Y.A.V., A.V.V., R.V.R. and M.R.K.; supervision, Y.A.V., R.V.R. and A.V.V.; project administration, Y.A.V., A.V.V., O.V.O. and R.V.R.; funding acquisition, Y.A.V., A.V.V., O.V.O. and R.V.R. All authors have read and agreed to the published version of the manuscript.

Funding

This article was prepared by a team of authors within the framework of a scientific and practical project in the field of medicine (No. EGISU: 125051305989-8) “A promising automated workplace of a radiologist based on generative artificial intelligence.”

Institutional Review Board Statement

The Local Ethics Committee of the Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department approved this study on 19 June 2025, with approval No. 06/2025.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
LLMLarge Language Model
ELEGANCEExpert-Led Evaluation of Generative AI Competence and ExcelleNCE
ROUGERecall-Oriented Understudy for Gisting Evaluation
BLEUBiLingual Evaluation Understudy
METEORMetric for Evaluation of Translation with Explicit ORdering
QUESTQuality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, Trust, and Confidence
QAMAIQuality Assessment of Medical Artificial Intelligence
CTComputed Tomography
IQRInterquartile Range
AIPIArtificial Intelligence Performance Instrument
CLEARCulturally and Linguistically Equitable AI Review
ATRAIAttitude of Radiologists Toward Radiology AI

Appendix A

Table A1. Example of a Patient’s Medical Records with Three Types of Summaries.
Table A1. Example of a Patient’s Medical Records with Three Types of Summaries.
Gender, ageMale, 75 years old
Examinations by medical specialists10 April 2025
R10.4—Abdominal pain (ICD-10 code).
Examination by a general practitioner.
Primary diagnosis: R10.4—Abdominal pain.
Preliminary diagnosis: Chronic pancreatitis, relapse? Gallstone disease, acute cholecystitis? Peptic duodenal ulcer, relapse?
Complaints: Diffuse, severe abdominal pain, intensifying after the meal, no relief after defecation; nausea, general weakness.
Disease history: Abdominal pain occurred a week before the examination. The patient associates it with a violation of the diet, consumption of fatty food. The patient took ibuprofen 400–800 mg /day for a week for pain relief with a short-term effect. No allergy.
General examination. Heart rate: 94 bpm. Pulse rate: 94 bpm. Respiratory rate: 24 breaths/min. Blood pressure: 135/80 mmHg. Skin: normal color. Edema: absent. Chest: Lung auscultation: vesicular breathing (normal). Wheezing: none. Heart sounds are muffled, the rhythm is regular, and there is no heart murmur. Abdominal cavity: The abdomen is symmetrical and distended; the painful area is localized in the epigastrium on superficial palpation; and pain is noted in all parts of the abdomen on deep palpation. There is no stool for two days. Urination is normal.
Recommendations:
Abdominal CT scan, stat!
Abdominal sonography, stat!
Surgeon consultation, stat!
CBC, general urinalysis, ALT, AST, ALP, GGT, total bilirubin, and creatinine cito!
9 September 2024
C34.3—Malignant neoplasm of lower lobe, bronchus, or lung (ICD-10 code).
Oncologist’s examination
Protocol type: confirmed malignant neoplasm.
Complaints at the time of examination: shortness of breath, pain in the chest on the right during a physical activity. No chronic pain syndrome.
Patient medical history: Suspected right lower lobe lung cancer since May 2018. On 31 October 2018, a right lower lobectomy was performed at the Regional Clinical Hospital.
Histological examination revealed grade 3 (G3) cancer, poorly differentiated and high-grade, with no lymph node metastases detected.
Immunohistochemical examination: G3 squamous cell carcinoma. Adjuvant treatment (additional chemotherapy or radiation therapy after the surgery) is not indicated.
Heredity: the father had lung cancer (code C34).
Concomitant diseases: acute myocardial infarction in 2005, hypertension, and benign prostatic hyperplasia.
He suffered a coronavirus infection in December 2020–January 2021 and was treated on an outpatient basis.
From 16 February to 20 February 2021, he received a treatment at the clinical research hospital.
Endoscopic polypectomy of two polyps was performed on 17 February 2021.
Histopathological examination: colon polyp tubular adenoma with low-grade dysplasia (benign intestinal tumor).
He has been smoking for over 60 years, one pack a day. Height 187 cm, weight 100 kg.
Instrumental studies:
Positron emission tomography combined with computed tomography (PET-CT) of 22 June 2020: diffuse increase in metabolic activity in the area of the postoperative scar, SUVmax 2.15. Masses appeared in the rectum with SUVmax 14.62 (SUV is an indicator of the accumulation intensity of a radiopharmaceutical drug, reflecting the process activity).
Fibrocolonoscopy of 21 July 2020: dolichosigma, epithelial masses in the descending colon and rectum. Biopsy: tubular adenoma of colon.
The chest CT scan of 16 February 2021, is unremarkable.
PET-CT scan of 16 August 2021: degenerative changes in the spine, 6th, 7th, and 8th rib fractures on the right side (surgical access), postoperative changes in the chest wall, SUVmax 2.69 without dynamics, and no new masses were detected.
Contrast-enhanced chest CT scan of 18 August 2022, showed signs of chronic bronchitis, partial bronchial obstruction with discharge, its postoperative deformation, and hypoventilation of the right lung middle lobe. Conclusion: fiberoptic bronchoscopy is indicated. Additionally, chronic obstructive pulmonary disease and bilateral chronic pyelonephritis were revealed (according to kidney sonography, there are no structural changes).
Thyroid sonography of 26 October 2022: right lobe—TI-RADS 3: several isoechoic nodes (TI-RADS 2) of 5–6 mm in diameter with smooth clear margins and perinodular blood flow are visualized; in its middle third part, a cystic multi-chamber formation of 17 × 15 × 16 mm (TI-RADS 3) with smooth clear margins and loci of perinodular blood flow is localized. Left lobe—TI-RADS 1.
On 26 October 2022, a sonography of the postoperative scar located in the right intercostal space was performed: In the middle third, a rounded isoechoic formation of 11 × 7 mm with smooth, clear margins and a hypoechoic border is localized above the scar; there are no signs of blood flow on color Doppler mapping (granuloma?). A hypoechoic formation of 7.5 × 5.0 mm with unclear and irregular margins is localized under the scar; there is no blood flow on color Doppler mapping (metastasis?).
Esophagogastroduodenoscopy on 26 November 2022: distal reflux esophagitis, signs of atrophic gastritis, acute gastric erosions, stomach epithelial formation, bulbitis.
Histological examination of 5 December 2022: gastric hyperplastic polyp with acute erosion.
Fibrocolonoscopy of 26 November 2022: colon diverticulosis, colon epithelial formations.
Chest CT scan of 1 September 2023: a condition after the right lower lobectomy. Staples are detected in the stump of the lower lobe bronchus and paramediastinal. An obvious solid induration has appeared around the staples, the adjacent segmental bronchi are constricted, a solid substrate has appeared in the medial middle lobe bronchus, and the middle lobe is hypoventilated. The bronchopulmonary lymph node anteriorly from the infiltrate has enlarged from 5 mm up to 9 mm along the short axis. There are no focal and infiltrative changes in the remaining parts of the right lower lobe and in the left lung.
PET-CT scan of 19 September 2023: No signs of the active tumor process are identified. There is no progression compared to 2021.
Chest CT scan of 13 April 2024: A negative dynamic is noted—increasedatelectasis of the right middle lobe.
Fibrobronchoscopy of 20 May 2024: Signs of cicatricial deformation and granulomas around the suture material are identified during a bronchial stump examination. A biopsy was taken; the histological examination result of 24 May 2024, is chronic bronchitis. There are single cells suspected of atypia.
PET-CT scan of 28 June 2024: there is no data for tumor progression, no changes compared to 2023.
Thoracic oncologist examination on 13 August 2024. The patient is in satisfactory condition and has clear consciousness, ECOG scale 1 (heavy work is limited; light or sedentary work is permissible). Primary diagnosis:
C34.3—Malignant neoplasm of lower lobe, bronchus, or lung.
Clinical diagnosis: Right lower lobe lung cancer, stage IB (cT2aN0M0). A condition after the surgery (lobectomy in October 2018). Hyperplasia of the right supraclavicular lymph nodes since 2019. Suspected cancer progression since April 2024.
Confirmation method: morphological (histology).
Morphological type: squamous cell carcinoma.
Recommendations: interval contrast-enhanced chest CT scan, follow-up visit with test results. Since the creatinine level is increasing, a general practitioner or nephrologist consultation is recommended at the local clinic.
Instrumental diagnosticsPET-CT scan with fluorine-18 fluorodeoxyglucose (18F-FDG) was performed on 28 June 2024.
The examination was conducted from the frontal-parietal zone to the plantar surface. Physiological distribution of the radiopharmaceutical is noted in the brain, kidneys, partially along the ureters, bladder, and the intestinal loops.
Head and neck.
No foci of pathological accumulation of the radiopharmaceutical were detected. No pathological formations were identified in the brain tissue. The ventricles are not dilated; the median structures are not displaced. No pathological changes in the neck soft tissues were identified. The thyroid gland has a heterogeneous structure due to the presence of a hypodense lesion measuring 17 mm without pathological accumulation of the radiopharmaceutical in the right lobe. Neck and supraclavicular lymph nodes are not enlarged.
Chest. No pathological increase in accumulation of radiopharmaceutical was noted in the organs and soft tissues of the chest. There are no focal and infiltrative changes in the pulmonary parenchyma. There is a condition after the right lower lobectomy; pathological formations and focal accumulation of the radiopharmaceutical in the stump area are not detected. Partial atelectasis of the right lung middle lobe without focal abnormal FDG uptake is noted. The lumens of the trachea and large bronchi are visible except for the right lower lobe. No fluid was detected in the pleural cavities. Intrathoracic lymph nodes are not enlarged, with no abnormal accumulation of the radiopharmaceutical. The heart and mediastinal vascular structures are unchanged. No effusion in the pericardial cavity was detected. Atherosclerotic changes in the walls of the thoracic aorta and coronary arteries are noted.
Abdomen and pelvis. No pathological increase in radiopharmaceutical accumulation was noted in the organs and tissues of the abdominal cavity, retroperitoneal space, and pelvis. The stomach is insufficiently filled; no reliable pathological changes in its walls are observed. The liver is not enlarged and has a homogeneous structure. The parenchyma density is 45 Hounsfield units (HU). Intrahepatic and extrahepatic ducts and vessels are not dilated. The gallbladder is unchanged; no radio-opaque stones were detected. The pancreas is not enlarged, the structure shows age-related changes, the pancreatic duct is not dilated. The spleen is not enlarged; the structure is unchanged. The adrenal glands are not enlarged. The kidneys are typically located, and the perinephric tissues are indurated and stringy, which are post-inflammatory changes. The pyelocaliceal system and ureters are not dilated. No stones were found along the urinary tract. The urinary bladder is not full enough. No pathological formations or focal accumulation of the radiopharmaceutical are reliably detected along the rectum.
Skeletal system and soft tissues. No pathological accumulation of radiopharmaceuticals was observed in the skeleton bones. No bone-destructive or bone-traumatic changes were detected on top of degenerative changes, most pronounced in the right shoulder joint. Fracture of the right seventh rib with the pseudoarthrosis (without any dynamics). Scar postoperative changes without hyperfixation of radiopharmaceuticals in the soft tissues of the lateral surface of the right chest wall at the level of the fifth intercostal space remain (decrease in the size of granulomas in dynamics). Edema of the subcutaneous tissue of the right shin is noted. Radiation dose: 47.50 mSv.
Laboratory testsLaboratory test result of 26 September 2024:
Hemoglobin A1c (glycated hemoglobin) 6.5% (normal range is 3.5–6%)
Laboratory test result of 30 January 2025:
Hemoglobin A1c (glycated hemoglobin) 6.6% (normal range is 3.5–6%)
Laboratory test result of 12 September 2024:
Urea 7.85 mmol/L (normal range is 2.8–7.2 mmol/L)
Discharge summaryDISCHARGE SUMMARY
Date: 5 April 2025, 08:24
Length of stay: 1 day
Admitting diagnosis:
M54.1 Radiculopathy. Degenerative dystrophic disease of the lumbar spine. Spondyloarthrosis. Facet joint syndrome.
Discharge diagnosis:
M54.1 Radiculopathy. Degenerative dystrophic disease of the lumbar spine. Spondyloarthrosis. Facet joint syndrome.
Complaints at admission:
Pain in the lumbar spine. Restricted movements in the lumbar spine due to pain.
History of present illness:
According to the patient, the pain syndrome is long-lasting. The disease is constantly present. He is receiving a treatment. A clinical examination was performed.
Examination results: Lumbosacral spine MRI—degenerative dystrophic changes in the spine.
Admission status:
Neurological status: Wakefulness level is clear consciousness. Orientation in space, time, and one’s own personality is preserved. Meningeal symptoms are absent. Pupils are symmetrical. Pupillary reaction to light is preserved. Visual fields are unchanged. The face is symmetrical. Speech is normal. There is no limb muscular paresis. There are no disturbances in sensitivity. The Romberg test is negative. There are no tension symptoms.
The general condition is satisfactory.
The skin is of normal color. Subcutaneous fat is moderately developed.
Respiratory system: The breathing type is independent. The respiratory rate is 24 breaths/min. The breathing rhythm is regular. The breathing pattern is vesicular (normal). There is no wheezing.
Cardiovascular system: Blood pressure is 120/80 mmHg, heart rate is 78 bpm.
Gastrointestinal tract: The tongue is pink and moist, without plaque. The abdomen is not enlarged, rounded, symmetrical, participates in breathing. On palpation, the abdomen is soft and painless. There are no symptoms of peritoneal irritation.
Genitourinary system: Urination is regular.
Local status: There are restricted movements in the lumbar spine due to pain and tension of paraspinal muscles of the lumbar spine. The “arch” symptom is positive. Pain on palpation of the facet joints; the pain intensity is 7 points according to the visual analog pain scale.
Treatment:
Periarticular drug injection
4 April 2025. In the operating room, under aseptic conditions, an X-ray-guided (in direct and oblique projections) medial branch block test of the right L2–L5 facet joints was performed.
Spinocan needle 22 G of 120 mm length was sequentially inserted to:
1. the transition point of the right L5 transverse process into the superior articular process;
2. the transition point of the right L4 transverse process into the superior articular process;
3. the transition point of the right L3 transverse process into the superior articular process;
4. the transition point of the right L2 transverse process into the superior articular process.
0.5 mL of ropivacaine 0.2% and 1 mg of dexamethasone were injected into each point.
An aseptic bandage was applied. During the procedure, vital signs were stable and neurological status was unchanged.
Conclusion:
A follow-up consultation at the outpatient Pain Management Center, if necessary, not earlier than in 2 weeks.
Recommendations:
Annual checkup (preventive medical examination).
Good-quality summaryComplaints that prompted the abdominal CT scan: Diffuse severe abdominal pain, intensifying after the meal, without relief after defecation. Nausea, general weakness. There is no stool for two days.
Disease history: Pain occurred a week before the examination; the patient associates it with eating fatty foods. He took ibuprofen 400–800 mg/day with a short-term effect.
Relevant medical history: Right lower lobe lung cancer, stage IB, rT2aN0M0 according to the TNM classification. A condition after the surgery (October 2018). Hyperplasia of the right supraclavicular lymph nodes since May 2019. Suspected cancer progression (April 2024).
Laboratory data (current, no more than six months before the study): none.
Instrumental study data relevant to the abdominal CT scan:
PET-CT scan of 28 June 2024. Conclusion: There is no evidence of the presence of increased metabolic activity foci typical for a neoplastic process with pathological accumulation of fluorine-18 fluorodeoxyglucose. Postoperative changes in the right anterior chest wall were noted. In comparison with the PET-CT scan of 19 September 2023, there are no structural dynamics, pathological foci with increased metabolism were not detected, and reliable data on the progression of the underlying disease were not obtained.
Moderate-quality summaryA 75-year-old patient complains of abdominal pain that intensifies after the meal. Nausea, general weakness, and the absence of stool for two days are noted. The pain occurred about a week ago after breaking the diet. The patient took ibuprofen with a temporary effect.
Physician examination: pain all over the abdomen, especially in the epigastrium. He was referred for the abdominal CT scan urgently.
From the medical history: right lung cancer; he underwent lobectomy; no relapse has been currently detected. Chronic pancreatitis, gastritis, bulbitis, and hyperplastic gastric polyp. Tubular adenoma and diverticular disease of the colon were also detected; fibrocolonoscopy and esophagogastroduodenoscopy were previously performed. There are concomitant diseases: arterial hypertension, ischemic heart disease, atrial fibrillation, chronic kidney disease, erythremia, and dyscirculatory encephalopathy. Decreased glucose tolerance is noted. The glycated hemoglobin level is 6.6%. He has had a coronavirus infection; he is currently under the observation of a pulmonologist. He suffers from dorsopathy and the lumbar pain syndrome. He underwent facet joint block injections. The complaints remain, the general condition is stable, the score on the Eastern Cooperative Oncology Group (ECOG) scale is 1 point (heavy work is limited, and light work is permissible).
Poor-quality summaryComplaints:
The patient complains of the absence of stool for two days.
Disease history:
There is no edema. Breathing and lung characteristics are assessed as normal (vesicular breathing).
Heart sounds are muffled, the rhythm is regular, and there is no heart murmur.
The abdomen is symmetrical and slightly bloated. The painful area is localized in the epigastrium on superficial palpation. Pain is noted in all parts of the abdomen on deep palpation.
Urination is normal.
The patient’s medical history (concomitant diseases, bad habits, family history, and surgeries).
Status: Restrictions of movement due to pain. Tension of the lumbar paraspinal muscles. Positive “arch” symptom and pain on palpation of the facet joints.
Pain intensity assessment on a visual analog scale is 7 points.
Laboratory data are not specified in the provided text.
Instrumental examinations:
The patient was referred for the abdominal CT scan.
Results of the lumbosacral MRI scan: degenerative and dystrophic changes in the spine.

Appendix B. ELEGANCE Questionnaire

  • To what extent does the LLM output aligned with the query?
(Select one option by ticking the box or marking with a cross)
The result does not address the query; the information is useless.  
The result addresses the query in part, but key aspects are missing.
The result corresponds to the query, though some details are absent.
The result is close to the expectations; key aspects are taken into account, but there are minor flaws.
The result perfectly meets the query; all key aspects are taken into account, and the information is useful and accurate.
2.
Does the LLM provide not only the explicitly requested data but also additional information relevant to completing the task?
(Select one option by ticking the box or marking with a cross)
There is no additional information; the result is limited to only the explicitly requested data.  
There is some additional information, but not enough for the sake of completeness.
The main additional information is taken into account, but there are some omissions.
Most of the important additional information is included; the result is close to the expectations.
All possible additional information that may be important is taken into account.
3.
To what extent does the LLM result help solve the task?
(Select one option by ticking the box or marking with a cross)
The result is useless or not in alignment with the task.  
The result is not completely useless, but it does not allow solving the task.
The result is generally useful, but there are some flaws.
The result is close to optimal with minor flaws (e.g., redundant data).
The result is ideal for solving the task.
4.
To what extent is the LLM-generated text clear, logical, and well structured?
(Select one option by ticking the box or marking with a cross)
The text is unclear; there is no logic in the presentation.  
The text is partially understandable, but the logic and structure are weak.
The text is generally clear, but there are some flaws in the logic or structure.
The text is clear, logical, and structured; the flaws are minimal.
The text is perfectly clear, logical, and structured.
5.
To what extent is the LLM-generated text linguistically and terminologically accurate?
(Select one option by ticking the box or marking with a cross)
The text contains many errors in language and terminology, which makes it unusable.  
The text contains significant errors in language or terminology that make it difficult to understand.
The text is mostly correct, but there are some errors or linguistic inaccuracies.
The text is almost completely correct; errors are minimal and do not affect understanding.
The text perfectly conforms to the norms of language and professional terminology; there are no errors.
6.
How satisfied are you with the result in terms of usefulness, clarity, and fulfillment of your expectations for LLM-based medical record summarization?
(Select one option by ticking the box or marking with a cross)
The result does not meet expectations at all.  
The result partially meets expectations but has significant shortcomings. The information is only marginally useful; key aspects are missing.
The result generally meets expectations, but there are some flaws. The information is useful but requires further improvement or clarification.
The result is close to expectations; minor flaws do not affect the overall usefulness.
The result fully meets expectations.
7.
Does the LLM response contain any information absent from the provided patient medical records?
(Circle the selected answer)
  • Yes
  • No

References

  1. Vasilev, Y.A.; Reshetnikov, R.V.; Nanova, O.G.; Vladzymyrskyy, A.V.; Arzamasov, K.M.; Omelyanskaya, O.V.; Kodenko, M.R.; Erizhokov, R.A.; Pamova, A.P.; Seradzhi, S.R.; et al. Application of large language models in radiological diagnostics: A scoping review. Digit. Diagn. 2025, 6, 268–285. [Google Scholar] [CrossRef]
  2. Bednarczyk, L.; Reichenpfader, D.; Gaudet-Blavignac, C.; Ette, A.K.; Zaghir, J.; Zheng, Y.; Bensahla, A.; Bjelogrlic, M.; Lovis, C. Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J. Med. Internet Res. 2025, 27, e68998. [Google Scholar] [CrossRef]
  3. Vasilev, Y.A.; Kozhikhina, D.D.; Vladzymyrskyy, A.V.; Shumskaya, Y.F.; Mukhortova, A.N.; Blokhin, I.A.; Suchilova, M.M.; Reshetnikov, R.V. Results of the Work of the Reference Center for Diagnostic Radiology with Using Telemedicine Technology. Zdravoohran. Ross. Fed. 2024, 68, 102–108. [Google Scholar] [CrossRef]
  4. Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.-B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef] [PubMed]
  5. Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Res. Sq. 2023, rs.3.rs-3483777. [Google Scholar] [CrossRef]
  6. Tang, L.; Sun, Z.; Idnay, B.; Nestor, J.G.; Soroush, A.; Elias, P.A.; Xu, Z.; Ding, Y.; Durrett, G.; Rousseau, J.F.; et al. Evaluating Large Language Models on Medical Evidence Summarization. Npj Digit. Med. 2023, 6, 158. [Google Scholar] [CrossRef]
  7. Tang, L.; Goyal, T.; Fabbri, A.; Laban, P.; Xu, J.; Yavuz, S.; Kryscinski, W.; Rousseau, J.; Durrett, G. Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Long Papers. Association for Computational Linguistics: Toronto, ON, Canada, 2023; Volume 1, pp. 11626–11644. [Google Scholar]
  8. Barbella, M.; Tortora, G. Rouge Metric Evaluation for Text Summarization Techniques. SSRN J. 2022. [Google Scholar] [CrossRef]
  9. Reiter, E. A Structured Review of the Validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
  10. Lavie, A.; Agarwal, A. Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 228–231. [Google Scholar]
  11. Datta, G.; Joshi, N.; Gupta, K. Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score. In Speech and Computer; Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; Volume 13721, pp. 155–162. ISBN 9783031209796. [Google Scholar]
  12. Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Publisher Correction: Large Language Models Encode Clinical Knowledge. Nature 2023, 620, E19. [Google Scholar] [CrossRef]
  13. Sivarajkumar, S.; Kelley, M.; Samolyk-Mazzanti, A.; Visweswaran, S.; Wang, Y. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Med. Inform. 2024, 12, e55318. [Google Scholar] [CrossRef]
  14. Chiang, C.-H.; Lee, H. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Long Papers. Association for Computational Linguistics: Toronto, ON, Canada, 2023; Volume 1, pp. 15607–15631. [Google Scholar]
  15. Song, H.; Su, H.; Shalyminov, I.; Cai, J.; Mansour, S. FineSurE: Fine-Grained Summarization Evaluation Using LLMs. arXiv 2024, arXiv:2407.00908. [Google Scholar]
  16. Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. npj Digit. Med. 2024, 7, 258. [Google Scholar] [CrossRef] [PubMed]
  17. Vaira, L.A.; Lechien, J.R.; Abbate, V.; Allevi, F.; Audino, G.; Beltramini, G.A.; Bergonzani, M.; Boscolo-Rizzo, P.; Califano, G.; Cammaroto, G.; et al. Validation of the Quality Analysis of Medical Artificial Intelligence (QAMAI) Tool: A New Tool to Assess the Quality of Health Information Provided by AI Platforms. Eur. Arch. Otorhinolaryngol. 2024, 281, 6123–6131. [Google Scholar] [CrossRef]
  18. Okuhara, T.; Ishikawa, H.; Ueno, H.; Okada, H.; Kato, M.; Kiuchi, T. Influence of high versus low readability level of written health information on self-efficacy: A randomized controlled study of the processing fluency effect. Health Psychol. Open 2020, 7, 2055102920905627. [Google Scholar] [CrossRef]
  19. Ebbers, T.; Kool, R.B.; Smeele, L.E.; Dirven, R.; den Besten, C.A.; Karssemakers, L.H.E.; Verhoeven, T.; Herruer, J.M.; van den Broek, G.B.; Takes, R.P. The Impact of Structured and Standardized Documentation on Documentation Quality; a Multicenter, Retrospective Study. J. Med. Syst. 2022, 46, 46. [Google Scholar] [CrossRef]
  20. Appelman, A.; Schmierbach, M. Make No Mistake? Exploring Cognitive and Perceptual Effects of Grammatical Errors in News Articles. Journal. Mass Commun. Q. 2018, 95, 930–947. [Google Scholar] [CrossRef]
  21. Lozano, L.M.; García-Cueto, E.; Muñiz, J. Effect of the Number of Response Categories on the Reliability and Validity of Rating Scales. Methodology 2008, 4, 73–79. [Google Scholar] [CrossRef]
  22. Koo, M.; Yang, S.-W. Likert-Type Scale. Encyclopedia 2025, 5, 18. [Google Scholar] [CrossRef]
  23. Soper, D.S. A-Priori Sample Size Calculator for Structural Equation Models [Software]. 2025. Available online: https://www.danielsoper.com/statcalc (accessed on 1 September 2025).
  24. Comrey, A.L.; Lee, H.B. A First Course in Factor Analysis, 2nd ed.; Psychology Press: London, UK, 1992. [Google Scholar] [CrossRef]
  25. Vasilev, Y.; Vladzymyrskyy, A.; Mnatsakanyan, M.; Omelyanskaya, O.; Reshetnikov, R.; Alymova, Y.; Shumskaya, Y.; Akhmedzyanova, D. Questionnaires Validation Methodology; State Budget-Funded Health Care Institution of the City of Moscow “Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department”: Moscow, Russia, 2024; Volume 133. [Google Scholar]
  26. Brown, T.A. Confirmatory Factor Analysis for Applied Research. In Methodology in the Social Sciences, 2nd ed.; The Guilford Press: New York, NY, USA; London, UK, 2015; ISBN 9781462517794. [Google Scholar]
  27. Shou, Y.; Sellbom, M.; Chen, H.-F. Fundamentals of Measurement in Clinical Psychology. In Comprehensive Clinical Psychology; Elsevier: Amsterdam, The Netherlands, 2022; pp. 13–35. ISBN 9780128222324. [Google Scholar]
  28. Lechien, J.R.; Maniaci, A.; Gengler, I.; Hans, S.; Chiesa-Estomba, C.M.; Vaira, L.A. Validity and Reliability of an Instrument Evaluating the Performance of Intelligent Chatbot: The Artificial Intelligence Performance Instrument (AIPI). Eur. Arch. Otorhinolaryngol. 2024, 281, 2063–2079. [Google Scholar] [CrossRef]
  29. Sallam, M.; Barakat, M.; Sallam, M. Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus 2023, 15, e49373. [Google Scholar] [CrossRef]
  30. Vasilev, Y.A.; Vladzymyrskyy, A.V.; Alymova, Y.A.; Akhmedzyanova, D.A.; Blokhin, I.A.; Romanenko, M.O.; Seradzhi, S.R.; Suchilova, M.M.; Shumskaya, Y.F.; Reshetnikov, R.V. Development and Validation of a Questionnaire to Assess the Radiologists’ Views on the Implementation of Artificial Intelligence in Radiology (ATRAI-14). Healthcare 2024, 12, 2011. [Google Scholar] [CrossRef]
Figure 1. Study design.
Figure 1. Study design.
Ai 06 00287 g001
Figure 2. Questionnaire development flowchart.
Figure 2. Questionnaire development flowchart.
Ai 06 00287 g002
Figure 3. The framework for the validation study design.
Figure 3. The framework for the validation study design.
Ai 06 00287 g003
Figure 4. Pilot testing results histogram for summarization evaluations (n = 24). Results of both test and retest evaluations are shown.
Figure 4. Pilot testing results histogram for summarization evaluations (n = 24). Results of both test and retest evaluations are shown.
Ai 06 00287 g004
Figure 5. Total scoring distribution of expert-rated medical record summaries categorized by ground-truth quality classes. Circles depict outliers.
Figure 5. Total scoring distribution of expert-rated medical record summaries categorized by ground-truth quality classes. Circles depict outliers.
Ai 06 00287 g005
Table 1. Methods for assessing reliability and validity.
Table 1. Methods for assessing reliability and validity.
DimensionMethodThresholds
Internal consistencyCronbach’s alpha≤0.5—unacceptable
>0.5—poor
>0.6—questionable
>0.7—acceptable
>0.8—good
>0.9—excellent [25]
Construct validityConfirmatory factor analysisComparative Fit Index (CFI) ≥0.9
Root Mean Square Error of Approximation (RSMEA) < 0.08
Standardized Root Mean Squared Residual (SRMR) < 0.08
Tucker–Lewis Index (TLI) ≥ 0.9 [26]
Differentiation by known groups Univariate nonparametric analysis of variance (Kruskal–Wallis rank test)p-value < 0.05
Table 2. ELEGANCE questionnaire structure.
Table 2. ELEGANCE questionnaire structure.
DimensionQuestionEvaluation System
RelevanceTo what extent does the LLM output aligned with the query?5-point Likert scale
CompletenessDoes the LLM provide not only the explicitly requested data but also additional information relevant to completing the task?5-point Likert scale
ApplicabilityTo what extent does the LLM result help solve the task?5-point Likert scale
FalsificationDoes the LLM response contain any information absent from the provided patient medical records?Binary yes/no answer
SatisfactionHow satisfied are you with the result in terms of usefulness, clarity, and fulfillment of your expectations for LLM-based medical record summarization?5-point Likert scale
StructureTo what extent is the LLM-generated text clear, logical, and well structured?5-point Likert scale
Correctness of language and terminologyTo what extent is the LLM-generated text linguistically and terminologically accurate?5-point Likert scale
Table 3. Factor loadings for the two-factor model.
Table 3. Factor loadings for the two-factor model.
FactorDimensionStandardized Factor Loadings p-Value
F1Relevance0.95<0.001
Completeness0.86<0.001
Applicability0.96<0.001
Falsification 0.45<0.001
Satisfaction0.97<0.001
F2Structure0.90<0.001
Grammatical correctness 0.79<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vasilev, Y.A.; Vladzymyrskyy, A.V.; Omelyanskaya, O.V.; Alymova, Y.A.; Akhmedzyanova, D.A.; Shumskaya, Y.F.; Kodenko, M.R.; Blokhin, I.A.; Reshetnikov, R.V. Development and Validation of a Questionnaire to Evaluate AI-Generated Summaries for Radiologists: ELEGANCE (Expert-Led Evaluation of Generative AI Competence and ExcelleNCE). AI 2025, 6, 287. https://doi.org/10.3390/ai6110287

AMA Style

Vasilev YA, Vladzymyrskyy AV, Omelyanskaya OV, Alymova YA, Akhmedzyanova DA, Shumskaya YF, Kodenko MR, Blokhin IA, Reshetnikov RV. Development and Validation of a Questionnaire to Evaluate AI-Generated Summaries for Radiologists: ELEGANCE (Expert-Led Evaluation of Generative AI Competence and ExcelleNCE). AI. 2025; 6(11):287. https://doi.org/10.3390/ai6110287

Chicago/Turabian Style

Vasilev, Yuriy A., Anton V. Vladzymyrskyy, Olga V. Omelyanskaya, Yulya A. Alymova, Dina A. Akhmedzyanova, Yuliya F. Shumskaya, Maria R. Kodenko, Ivan A. Blokhin, and Roman V. Reshetnikov. 2025. "Development and Validation of a Questionnaire to Evaluate AI-Generated Summaries for Radiologists: ELEGANCE (Expert-Led Evaluation of Generative AI Competence and ExcelleNCE)" AI 6, no. 11: 287. https://doi.org/10.3390/ai6110287

APA Style

Vasilev, Y. A., Vladzymyrskyy, A. V., Omelyanskaya, O. V., Alymova, Y. A., Akhmedzyanova, D. A., Shumskaya, Y. F., Kodenko, M. R., Blokhin, I. A., & Reshetnikov, R. V. (2025). Development and Validation of a Questionnaire to Evaluate AI-Generated Summaries for Radiologists: ELEGANCE (Expert-Led Evaluation of Generative AI Competence and ExcelleNCE). AI, 6(11), 287. https://doi.org/10.3390/ai6110287

Article Metrics

Back to TopTop