1. Introduction
The integration of artificial intelligence (AI) in medicine has long been a subject of discussion; however, a notable increase in clinical studies and investments in this field has been observed in recent years [
1,
2]. These technologies, which are increasingly being utilized in diagnosis, treatment planning, and clinical decision support systems, have significantly transformed the ways in which information is accessed and used in healthcare [
2,
3].
Large language models (LLMs), as a key component of these advancements, have attracted considerable attention due to their ability to generate instant and structured responses to natural language queries [
4,
5]. Systems such as ChatGPT and Gemini have emerged as alternative sources of medical information not only for healthcare professionals but also for patients, owing to their ability to provide rapid access to medical knowledge [
1,
6]. This has brought into focus the potential role of these systems, particularly in areas where health literacy and patient education are of increasing importance [
6,
7].
Nevertheless, the role of AI applications in clinical practice has not developed uniformly across all disciplines, and the number of studies remains relatively limited, particularly in the fields of rehabilitation and musculoskeletal disorders [
8,
9]
Adolescent idiopathic scoliosis (AIS) is a commonly encountered spinal deformity in clinical practice that requires long-term follow-up. The success of conservative treatment approaches largely depends on the patient’s understanding of the condition and active participation in the treatment process [
10,
11]. Today, patients and their families frequently seek health-related information via the internet and increasingly turn to AI-based systems during this process [
1,
6]. Moreover, the fact that scoliosis predominantly affects adolescents is noteworthy, as this population demonstrates high engagement with digital technologies and a strong capacity to adapt to innovative technological tools.
In this context, LLMs may serve as a potential source of information in patient education processes [
6,
12]. The growing tendency of young individuals to rely on digital resources and AI-based tools for health-related inquiries necessitates the evaluation of the accuracy, reliability, and clinical appropriateness of the information provided by these systems [
1,
13].
Although existing studies indicate a growing body of evidence regarding the performance of LLMs across various medical domains [
1,
2], there remains a limited number of studies that comprehensively evaluate responses to patient-oriented queries specifically in the context of scoliosis [
8,
14,
15]. Furthermore, these responses should be assessed not only in terms of accuracy but also in terms of readability and patient-centered appropriateness [
6,
13]. While AI-based LLMs offer potential benefits in patient education, they must be carefully evaluated due to the risk of generating inaccurate or incomplete information [
1,
13].
This study aimed to evaluate ChatGPT-5.3 (OpenAI, San Francisco, CA, USA; 2026) responses to patient-oriented questions on adolescent idiopathic scoliosis in terms of expert-perceived appropriateness, scientific accuracy, adequacy, clarity, and potential clinical risk.
2. Materials and Methods
2.1. Study Design and Ethical Considerations
This study was designed as a cross-sectional expert evaluation study aimed at assessing the content validity and quality characteristics of responses provided to patient-oriented questions regarding the conservative management of idiopathic scoliosis. The study process consisted of two main phases: (1) the development of a question set reflecting the patient perspective, and (2) the multidimensional evaluation of AI-generated responses to these questions by expert clinicians.
This study did not involve patients, patient records, biological samples, clinical interventions, or identifiable patient data. The study was based solely on an anonymous voluntary survey completed by expert clinicians who evaluated AI-generated educational content. No identifiable or sensitive personal data were collected. Electronic informed consent was obtained from all expert participants before participation.
Because the study did not involve patients, clinical interventions, biological materials, or identifiable health data, formal ethical approval was not required according to the applicable national regulatory framework [
16]. Nevertheless, the study was conducted in accordance with the ethical principles of voluntary participation, anonymity, confidentiality, and responsible research conduct. The reporting of the study was also informed by the DECIDE-AI framework for the early-stage evaluation of artificial intelligence–based clinical decision support systems [
17].
2.2. Development of the Question Set
The questions used in this study were determined through a systematic content review to represent the most frequently asked topics by patients in routine clinical practice regarding scoliosis. For this purpose, authoritative national and international websites were examined, including the Turkish Scoliosis Research and Treatment Society (
https://skolyozdernegi.com/, accessed on 30 January 2026), the Scoliosis Research Society (
https://www.srs.org/, accessed on 30 January 2026), the International Society on Scoliosis Orthopaedic and Rehabilitation Treatment (SOSORT) (
https://sosort.org/, accessed on 30 January 2026), and the National Scoliosis Foundation (
https://www.scoliosis.org/, accessed on 30 January 2026).
In addition, a Google search was conducted to identify commonly accessed patient-oriented information. To ensure standardization and minimize potential bias, the search was performed on 30 January 2026, using incognito mode with a cleared browser cache and no active user account to avoid personalization effects. All searches were conducted in Turkish using the keyword “skolyoz,” with location settings fixed to Türkiye. The first 10 non-sponsored results were included, while advertisements and promoted content were excluded from the analysis.
The data obtained from these sources were combined to create a comprehensive pool of questions covering the most frequently expressed concerns of patients. The question pool was reviewed in terms of clinical relevance, content coverage, and clarity. Duplicate or conceptually overlapping questions were removed, and the final set of questions was established. As a result, a total of 20 questions were included in the study, covering key aspects of the condition, including definition, etiology, progression, diagnostic methods, conservative treatment approaches, and impact on daily life (
Table 1).
All survey materials and evaluations were conducted in Turkish to ensure consistency with the language of the search and the target patient population.
Questions were not intended to represent an exhaustive or validated inventory of all patient informational needs. Instead, they were designed to capture common patient-oriented themes encountered in publicly accessible scoliosis information sources. The final set was reviewed for thematic coverage, clinical relevance, clarity, and avoidance of conceptual overlap.
2.3. Generation of AI Responses
The selected questions were submitted to a large language model (ChatGPT-5.3) on the same day using a standardized approach. To ensure consistency and reflect a patient-centered perspective, all questions were initiated with the phrase “I am a patient with scoliosis.”
Additionally, to ensure comparability in terms of length and content, the model was instructed to generate responses between 150 and 200 words for each question. The generated responses were recorded verbatim without any modifications and compiled in a standardized format for evaluation.
To prevent interaction effects between responses, each question was submitted in a separate session by opening a new ChatGPT interface.
2.4. Expert Panel
An expert panel consisting of healthcare professionals from different disciplines was formed to evaluate the responses. The panel included clinicians with experience in physical medicine and rehabilitation, orthopedics and traumatology, and physiotherapy. All participants were required to have clinical experience in the evaluation and management of patients with scoliosis.
2.5. Evaluation Procedure
An online evaluation form (Google Forms) containing all AI-generated responses was distributed to the experts. Each response was evaluated independently and anonymously.
Experts were asked to assess each response across four dimensions:
For reporting clarity, each evaluation dimension was operationally defined. Overall appropriateness referred to the suitability of the response for patient education in a clinical context. Scientific accuracy referred to consistency with current accepted medical knowledge. Adequacy referred to whether the response sufficiently covered the key aspects of the question. Clarity referred to the understandability of the response for a patient audience.
All evaluations were performed using a 6-point Likert scale:
1: Completely inappropriate
2: Inappropriate
3: Partially inappropriate
4: Partially appropriate
5: Appropriate
6: Completely appropriate
For content validity analysis, scores were dichotomized:
1–3: Inappropriate
4–6: Appropriate
For CVR/CVI calculation, Likert scores were dichotomized as 1–3 = inappropriate and 4–6 = appropriate, in accordance with the predefined content validity framework. This dichotomization was used only for CVR/CVI analyses. The original 6-point ordinal ratings were retained for descriptive analyses and sensitivity analyses.
For responses rated as “inappropriate,” experts were asked to specify the reason. The following predefined categories were used (multiple selections were allowed, with optional additional comments):
Contains scientific errors
Contains incomplete information
Does not adequately answer the question/off-topic
Language or expression issues
Contains excessive or unnecessary information
Other (with open-ended explanation)
2.6. Clinical Risk Assessment
Clinical risk was evaluated based on the potential of each response to cause patient misguidance, delayed care, or inappropriate self-management. Risk levels were classified as harmless, low risk, moderate risk, and high risk according to the operational definitions presented in
Table 2:
At the end of the survey, experts were also asked to provide overall evaluations regarding:
The general quality of ChatGPT responses
Their usability in patient education
Their appropriateness for use without physician supervision
Commonly encountered issues
2.7. Statistical Analysis
All statistical analyses were performed using IBM SPSS Statistics for Windows, Version 31.0 (IBM Corp., Armonk, NY, USA). Ordinal data were presented as median and interquartile range (IQR), while categorical data were expressed as frequency and percentage (%).
Content validity was assessed using the Content Validity Ratio (CVR). CVR was calculated using the following formula:
where “n
e” represents the number of experts rating a response as “appropriate,” and N represents the total number of experts.
The Content Validity Index (CVI) was calculated as the average of CVR values across all questions.
Considering the number of experts (N = 51), the minimum acceptable CVR threshold was determined to be approximately 0.26 according to Lawshe’s criteria, and values above this threshold were considered acceptable in terms of content validity [
18].
The primary analysis was conducted based on the “overall appropriateness” dimension. Additionally, CVR values were calculated separately for scientific accuracy, adequacy, and clarity to provide a detailed evaluation of responses across different quality dimensions.
Reasons for inadequate or inappropriate responses were analyzed as categorical variables, and their frequency and percentage distributions were calculated. The chi-square test was used to compare the distribution of inadequacy reasons across different questions.
Inter-rater agreement was assessed using Fleiss’ kappa in IBM SPSS Statistics for Windows, version 26.0 (IBM Corp., Armonk, NY, USA; 2019).
3. Results
A total of 51 experts participated in the study. Of these, 52.9% were female and 47.1% were male. The mean age was 37.8 ± 7.4 years, and the mean professional experience was 12.5 ± 7.7 years. Participants were from the fields of Physical Medicine and Rehabilitation (47.1%), Orthopedics and Traumatology (29.4%), and Physiotherapy (23.5%). All participants had experience in the assessment and treatment of scoliosis (
Table 3).
Fleiss’ kappa for the primary evaluation dimension, general appropriateness, was 0.138.
Table 4 presents the item-level appropriateness rates and CVR values for all evaluated questions across the four assessment dimensions. The CVR/CVI analysis demonstrated a high level of expert agreement across all evaluated dimensions. CVR values met or exceeded the predefined acceptable threshold for all questions, with most items achieving values close to 1.00. The proportion of responses rated as appropriate ranged from 94.1% to 100.0%.
Relatively lower CVR values were observed in a limited number of questions, particularly for Q18, which addressed the most suitable sport for scoliosis. For this item, scientific accuracy and adequacy showed the lowest agreement levels, with appropriateness rates of 94.1% and CVR values of 0.88. Minor dimension-specific variations were also observed in selected items, including Q2, Q6, Q7, Q8, Q12, Q16, and Q20; however, overall expert agreement remained high.
The overall Content Validity Index (CVI) was 0.99 for general appropriateness and clarity, and 0.98 for scientific accuracy and adequacy. These findings indicate a high level of expert agreement regarding the perceived appropriateness of the AI-generated responses across the evaluated dimensions.
As a sensitivity analysis, the original 6-point ordinal ratings were analyzed without dichotomization. Mean scores remained high across all evaluation dimensions, ranging from 5.03 ± 1.01 for adequacy to 5.17 ± 0.99 for clarity. The median score was 6.0 across all dimensions, with an IQR of 2.0, indicating that the overall rating pattern remained favorable when the full ordinal scale was considered (
Table 5).
Analysis of the reasons for inappropriate ratings in general appropriateness revealed that the most common issue was insufficient information (52.2%), followed by scientific inaccuracies (21.7%), language or clarity issues (8.7%), excessive or unnecessary information (8.7%), and failure to adequately answer the question (4.3%) (
Table 6).
Expert opinions regarding the overall evaluation and educational usefulness of ChatGPT-generated responses are presented in
Figure 1 and
Figure 2. Overall, 27 experts (52.9%) rated the responses as very good, 19 (37.3%) as good, 4 (7.8%) as fair, and 1 (2.0%) as very poor. Regarding usefulness for patient education, 32 experts (62.7%) answered “yes,” 14 (27.5%) answered “partially,” and 5 (9.8%) answered “no.”
However, experts were more cautious regarding use without physician control. As shown in
Figure 3, only 2 experts (3.9%) considered it appropriate to present AI-generated responses to patients without physician control, whereas 25 (49.0%) answered “partially” and 24 (47.1%) answered “no.”
In terms of clinical risk, the majority of responses were classified as harmless or low risk, while a smaller proportion were categorized as moderate risk (
Figure 4). No responses were classified as high risk. Risk percentages were calculated across all expert-response ratings (20 questions × 51 experts = 1020 ratings)
The most frequently reported issue was insufficient information (47.1%), while no issues were identified in 35.3% of responses. Other reported problems included incomplete or off-topic responses (7.8%), excessive information (5.9%), and scientific errors (3.9%) (
Table 7).
4. Discussion
In this study, responses generated by a GPT-5.3–based large language model to patient-oriented questions on scoliosis were analyzed using a multidimensional expert evaluation approach. The findings indicate that, in this controlled expert-evaluation setting, ChatGPT-5.3 responses were generally perceived by clinicians as appropriate and understandable for patient education; however, such responses should still be carefully interpreted in terms of factual accuracy, content adequacy, and clinical safety.
Compared with previous studies evaluating earlier versions of large language models, our findings showed higher CVR/CVI values and higher levels of expert-rated appropriateness [
15]. For instance, a recent study evaluating ChatGPT-4.0 in the context of scoliosis reported that 78.5% of responses met the minimum CVR threshold, with an overall CVI of 0.68, and several responses showed limitations in scientific accuracy and completeness [
15]. In contrast, in the present study, all responses exceeded the predefined CVR threshold, and overall CVI values approached 0.99 across multiple evaluation domains. This difference may partly reflect improvements in large language model performance over time; however, it may also be related to differences in question selection, prompt structure, language, expert panel composition, scoring framework, and statistical handling of ratings. Therefore, direct version-based superiority cannot be inferred.
These findings are specific to ChatGPT-5.3 under the standardized prompting conditions used in this study. Patients may use different or freely accessible LLM versions in real-world settings, and these versions may vary in accuracy, completeness, safety safeguards, and response style. Therefore, the present findings should not be generalized to all AI-generated scoliosis information or interpreted as supporting independent patient use. Regardless of the source of differences between models, the persistence of insufficient detail in some responses indicates that professional contextualization remains important. Importantly, unlike previous studies, our analysis also incorporated clinical risk assessment, thereby providing additional descriptive insight into the perceived safety profile of AI-generated responses.
From an expert-agreement perspective, all questions exceeded the predefined CVR threshold, and the overall CVI values were high, indicating that the responses were generally perceived as appropriate by the expert panel. These findings are consistent with previous studies suggesting that large language models may generate medical information that is often perceived as understandable and useful, although concerns regarding completeness, contextual accuracy, and safety remain. Prior research has reported promising performance of large language models in medical knowledge tasks and clinical question-answering contexts [
19,
20,
21]. Importantly, high CVR and CVI values in this study should be interpreted as indicators of expert agreement regarding perceived appropriateness, not as definitive evidence of factual correctness, clinical reliability, or guideline concordance.
Although the overall appropriateness rates and CVR/CVI values were high, Fleiss’ kappa for the primary evaluation dimension indicated only slight agreement. This apparent discrepancy should be interpreted in light of the marked ceiling effect and substantial category imbalance in the rating distribution. Because most responses were rated as appropriate, kappa statistics may underestimate agreement despite high observed concordance. This pattern is consistent with the known kappa paradox and suggests that Fleiss’ kappa should not be interpreted in isolation in datasets with highly imbalanced ratings [
22].
However, evaluations based solely on overall appropriateness may be insufficient to capture all aspects of response quality. For this reason, responses in the present study were additionally assessed in terms of scientific accuracy, adequacy, and clarity. The results demonstrated that clarity received the highest scores, whereas adequacy and scientific accuracy were relatively lower. This suggests that while large language models are capable of generating generally appropriate and fluent content, some responses may remain limited in terms of depth and clinical detail.
Analysis of responses rated as inappropriate revealed that the most common issue was insufficient information. This suggests that the main limitation of the AI-generated responses was not necessarily overtly incorrect information, but rather limited depth or insufficient clinical contextualization. This may be particularly relevant for questions requiring individualized guidance, such as sports participation, bracing, treatment selection, follow-up, or specialist referral. Nevertheless, the overall findings suggest that the evaluated model was generally rated by experts as producing understandable and patient-oriented responses. The high proportion of responses rated as appropriate supports the potential use of such systems as supportive educational tools, provided that their outputs are reviewed and contextualized by healthcare professionals.
In this context, although many responses were perceived as acceptable by experts, some responses did not fully reflect the level of detail required for clinical decision-making. Consistent with the literature, large language models may generate responses that appear clinically plausible and understandable, but may still be incomplete or contextually limited in clinical settings [
1,
23]. One important clinical implication of this limitation is the possibility that patients may base decisions on information that appears accurate but lacks critical details. In particular, insufficient information regarding treatment options, follow-up requirements, or potential risks may lead to delays in care or inappropriate decision-making. At the same time, the ability of these systems to provide accessible and understandable information may represent an opportunity for improving patient education and awareness. From a social perspective, AI-based systems offer rapid, accessible, and comprehensible health information, which may facilitate access to medical knowledge and support health literacy. However, reliance on such systems in the presence of incomplete or contextually limited information may also contribute to the development of inappropriate health behaviors. Therefore, AI-generated information should be used as a supportive tool in patient education and should be interpreted in conjunction with professional medical guidance.
The expert opinion findings further suggest an important distinction between educational usefulness and unsupervised use. Although most experts considered the responses useful or partially useful for patient education, only a small proportion supported presenting them to patients without physician guidance. This finding should be interpreted together with the adequacy and risk findings. Most responses were perceived as harmless or low risk, but insufficient information was the most frequently reported issue, and adequacy and scientific accuracy received relatively lower ratings than clarity. Therefore, experts’ caution regarding unsupervised use may not reflect a perception of immediate harm, but rather concern that AI-generated responses may lack sufficient clinical depth, individualization, or contextual guidance for direct patient use. A response that is generally appropriate for one patient may be incomplete or misleading for another if it is applied without considering individual clinical characteristics.
In line with the DECIDE-AI framework, one of the key strengths of this study is the evaluation of responses not only in terms of accuracy but also with respect to potential clinical risk. The assessment of AI systems should consider not only accuracy but also clinical risk and safety dimensions [
6,
17]. The findings of this study showed that the majority of responses were classified as harmless or low risk, and no responses were identified as high risk. These results suggest that, within the evaluated question set and expert-rating framework, the responses were generally perceived as low risk. However, the presence of a limited number of responses categorized as moderate risk suggests that, in certain cases, the outputs should be interpreted with caution from a clinical perspective. This approach is consistent with studies emphasizing that AI applications should be integrated into clinical settings in a controlled and responsible manner [
24,
25].
Overall, this study suggests that large language models may have potential as supportive tools in patient education, provided that their outputs are interpreted with appropriate clinical contextualization and are not used as standalone sources of medical advice. These systems should therefore not be regarded as independent decision-makers, but rather as supplementary tools that may support patient education when used alongside professional medical guidance.
An important methodological consideration is the absence of a formal comparison with guideline-based reference answers. The AI-generated responses were evaluated by experts but were not directly compared with recommendations from established scoliosis guidelines or consensus documents. Therefore, high expert agreement should not be interpreted as definitive evidence of guideline concordance or complete clinical validity.
This study has several limitations. First, the evaluation was based on AI-generated responses obtained within a specific time frame. As large language model outputs may evolve over time with model updates, the findings may vary across different time points. Second, the evaluation relied on expert opinions, and the patient perspective was not directly assessed. Therefore, important dimensions such as patient understanding, emotional response, trust, and perceived usefulness could not be evaluated. Third, the study was limited to a single large language model, which may restrict the generalizability of the findings to other AI systems. Fourth, the question set was not developed using direct patient interviews, focus groups, or validated patient-information-needs instruments. Although the questions were based on publicly accessible patient-oriented scoliosis sources, they may not fully represent the diversity of real-world patient concerns. Fifth, this study focused on expert-perceived response quality and potential clinical risk, but did not assess the impact of AI-generated information on patient behavior, decision-making processes, or clinical outcomes. Sixth, the study did not include a formal gold-standard comparison. Although responses were evaluated by experienced clinicians, they were not directly compared with predefined guideline-based reference answers derived from established scoliosis recommendations. Finally, the dichotomization of Likert scores for CVR/CVI calculation and the marked ceiling effect in the rating distribution may have reduced the granularity of expert evaluations. Therefore, CVR/CVI findings were interpreted together with the original ordinal rating distributions and sensitivity analyses.