AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios

Michou, Ioanna; Fouras, Athanasios; Chrysanthakopoulou, Dionysia; Theodoritsi, Marina; Mariettou, Savina; Stellatou, Sotiria; Koutsojannis, Constantinos

doi:10.3390/app16031165

Open AccessArticle

AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios

by

Ioanna Michou

,

Athanasios Fouras

,

Dionysia Chrysanthakopoulou

,

Marina Theodoritsi

,

Savina Mariettou

,

Sotiria Stellatou

and

Constantinos Koutsojannis

^*

Health Physics & Computational Intelligence Lab, Department of Physiotherapy, School of Rehabilitation Sciences, University of Patras, 26504 Patras, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1165; https://doi.org/10.3390/app16031165

Submission received: 9 December 2025 / Revised: 19 January 2026 / Accepted: 21 January 2026 / Published: 23 January 2026

(This article belongs to the Special Issue Artificial Intelligence Innovations for Smart and Sustainable Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Generative artificial intelligence (GenAI), particularly large language models (LLMs) such as ChatGPT and DeepSeek, is transforming healthcare by enhancing clinical decision-making, education, and patient interaction. This exploratory study compares the responses of ChatGPT (GPT-4.1) and DeepSeek-V2 against 90 final-year physiotherapy students in Greece on the quality of the responses to 60 clinical questions across four rehabilitation domains: low back pain, multiple sclerosis, frozen shoulder, and knee osteoarthritis (15 questions per domain). The questions spanned basic knowledge, diagnosis, alternative treatments, and rehabilitation practices. The responses were evaluated for their relevance, accuracy, clarity, completeness, and consistency with clinical practice guidelines (CPGs), emphasizing conceptual understanding. This study provides novel contributions by (i) benchmarking LLMs in physiotherapy-specific domains (low back pain, multiple sclerosis, frozen shoulder, and knee osteoarthritis) underrepresented in prior AI-health evaluations; (ii) directly comparing the LLM written response quality to student performance under exam constraints; and (iii) highlighting the improvement potential for education, complementing ChatGPT’s established role in physician decision support. The results indicate that the LLMs produced higher-quality written responses than students in most domains, particularly in the global response quality and the conceptual depth of written responses, highlighting their potential as educational aids for knowledge-based tasks, although not equivalent to clinical expertise. This suggests AI’s role in physiotherapy as a supportive tool rather than a replacement for hands-on clinical skills and asks whether GenAI could transform physiotherapy practice by augmenting, rather than threatening, human-centered care, for its potential as a knowledge support tool in education, pending validation in clinical contexts. This study explores these findings, compares them with the related work, and discusses whether GenAI will transform or threaten physiotherapy practice. Ethical considerations, limitations, and future directions, including AI voice assistants and AI characters, are addressed.

Keywords:

generative AI; large language models; physiotherapy; rehabilitation; clinical decision-making; medical education; ChatGPT; DeepSeek; low back pain; multiple sclerosis; frozen shoulder; knee osteoarthritis; AI voice assistants; AI characters

1. Introduction

The integration of artificial intelligence (AI), particularly generative AI (GenAI) and large language models (LLMs), into healthcare has ushered in a new era of innovation, transforming clinical practice, education, and research [1]. LLMs, such as OpenAI’s ChatGPT (GPT-4) and DeepSeek AI, leverage advanced natural language processing (NLP) to generate human-like text, offering applications for clinical decision support, patient education, and professional training [2,3]. In physiotherapy, a discipline that blends scientific knowledge with hands-on clinical skills, AI technologies have shown significant promise in areas such as motion analysis, wearable technologies, and predictive modeling of patient outcomes [4]. However, the application of LLMs in physiotherapy remains relatively underexplored, particularly in their ability to address complex clinical queries compared to human expertise. This study is one of the first to systematically evaluate LLMs against physiotherapy students in clinical question answering, shedding light on their potential to augment or challenge traditional physiotherapy education and practice.

This comparison is inherently asymmetrical—LLMs draw from vast corpora without time constraints, while students operate under cognitive loads similar to exams—yet it serves as a valuable benchmark for LLMs’ utility in supporting, rather than supplanting, student learning [5]. Amid GPT’s growing use in clinical decision-making [6], this work’s relevance lies in its first systematic evaluation of LLMs’ and physiotherapy students’ responses, addressing a gap in allied health education where AI applications lag behind medicine.

In physiotherapy, AI has advanced motion analysis and wearables [4], but LLMs remain underexplored: Safran and Yildirim (2025) found 80% CPG alignment in musculoskeletal queries [2], while Lowe (2024) advocates curriculum integration [7]. Yet, no studies systematically compare LLMs’ responses to students in rehabilitation domains, a gap this work addresses by evaluating the written response quality for low back pain, multiple sclerosis, etc., amid broader healthcare AI growth [1,6]. Recent works in 2025, such as Gürses et al. [8] on personalized knee OA programs and Bitterman et al. [9] on PM&R accuracy, affirm LLM gains but overlook student comparisons—our focus (Table 1). Greek physiotherapy curricula, aligned with EU standards yet emphasizing hands-on musculoskeletal training, may shape student performance differently from global variants [10,11,12,13].

Beyond text-based LLMs, emerging AI technologies such as AI voice digital assistants and AI characters are gaining traction in healthcare [19]. AI voice assistants, such as Amazon’s Alexa or Google Assistant, enable hands-free interaction, delivering real-time clinical information, guiding patients through exercises, or assisting clinicians with documentation [20]. For instance, voice-activated systems can provide verbal instructions for home-based rehabilitation programs, improving patient adherence and accessibility [21]. AI characters, or virtual avatars powered by GenAI, offer interactive and personalized experiences by simulating patient or clinician roles in training scenarios [22]. These avatars can engage in dynamic conversations, model clinical interactions, or serve as virtual patients for student practice, potentially revolutionizing physiotherapy education and telehealth delivery [23]. While these technologies are in their nascent stages, their potential to enhance engagement, accessibility, and scalability in physiotherapy warrants further investigation.

The application of AI in medical education has seen rapid growth across various disciplines, providing a broader context for understanding its role in physiotherapy. In medical schools, LLMs have been employed to support case-based learning, simulate patient interactions, and assist with diagnostic reasoning [5]. For example, studies in medical education have demonstrated that LLMs such as ChatGPT can generate accurate responses to clinical vignettes, achieving performance levels comparable to third-year medical students in certain domains [6]. In nursing education, AI-driven virtual patients have been used to train students in clinical decision-making, improving their ability to handle complex scenarios without the logistical challenges of real-world clinical placements [24]. Similarly, in allied health professions such as occupational therapy, AI tools have been integrated into curricula to enhance students’ understanding of evidence-based practice and patient communication [25]. These applications highlight the versatility of AI in medical education, but physiotherapy’s unique emphasis on manual therapy and patient rapport presents distinct challenges and opportunities for AI integration.

In clinical practice, AI has been increasingly adopted to support decision-making and optimize patient outcomes across healthcare fields. In radiology, AI algorithms assist in image interpretation, achieving diagnostic accuracy comparable to or surpassing human experts in specific tasks [26]. In cardiology, predictive models powered by AI analyze patient data to forecast adverse events, enabling early interventions [27]. In physiotherapy, AI-driven tools such as wearable sensors and motion capture systems have been used to monitor patient progress and tailor rehabilitation programs [4]. However, the use of LLMs in clinical physiotherapy remains limited, with most studies focusing on their ability to provide general medical knowledge rather than domain-specific expertise [2]. The current study addresses this gap by evaluating LLMs in the context of physiotherapy-specific clinical scenarios, comparing their performance to that of final-year students.

Physiotherapy relies heavily on evidence-based clinical practice guidelines (CPGs) to manage conditions such as low back pain, multiple sclerosis, frozen shoulder, and knee osteoarthritis [28]. Recent studies indicate that ChatGPT aligns with CPGs in approximately 80% of musculoskeletal queries, although it struggles with context-specific scenarios, such as lumbosacral radicular pain [2,14]. Concerns about AI “hallucinations” (fabricated or incorrect responses), data privacy, and ethical integration remain significant barriers to widespread adoption [29]. In other health professions, LLMs have demonstrated variable performance. For instance, in orthopedics, ChatGPT achieved 45–73.6% accuracy on examination questions, with notable limitations in nuanced or context-dependent scenarios [18]. Domain-specific models, such as BioMedLM, have shown superior performance in medical education by leveraging specialized training datasets [17]. These findings suggest that while LLMs hold significant potential to augment healthcare delivery, they cannot fully replace human judgment, empathy, or hands-on skills, particularly in fields such as physiotherapy.

The emergence of AI voice assistants and AI characters introduces new possibilities for clinical practice and education. In medical education, AI voice assistants have been used to simulate patient interactions, allowing students to practice communication skills in a controlled environment [30]. For example, a study in medical training found that voice-activated AI systems improved students’ ability to elicit patient histories by providing real-time feedback [31]. In physiotherapy, voice assistants guided patients through home-based exercises, offering real-time corrections and motivational prompts to enhance adherence [21]. AI characters, meanwhile, have been explored in mental health and nursing education to simulate patient interactions, helping to foster empathy and clinical reasoning [22,23]. These technologies could be adapted for physiotherapy to create virtual patients with diverse clinical presentations, enabling students to practice assessment and treatment planning in a low-risk setting. However, their effectiveness in physiotherapy-specific contexts remains largely untested, highlighting the need for further research.

This study is among the first to compare ChatGPT and DeepSeek against 90 final-year physiotherapy students in answering 60 clinical questions across four rehabilitation domains: low back pain, multiple sclerosis, frozen shoulder, and knee osteoarthritis (15 questions per domain covering basic knowledge, diagnosis, alternative treatments, and rehabilitation practices). These domains were selected due to their prevalence and diverse clinical demands [10,11,12,13]. Low back pain, affecting up to 80% of adults, requires precise assessment and management [10]. Multiple sclerosis involves progressive neurological deficits, necessitating tailored interventions [14]. Frozen shoulder and knee osteoarthritis demand biomechanical expertise and pain management [12,13]. The study aimed to accomplish the following:

1.: Evaluate LLMs and student responses for quality and conceptual understanding.
2.: Assess LLMs’ potential as educational and clinical tools in physiotherapy.
3.: Explore whether GenAI could support physiotherapy, including the role of AI voice assistants and characters, with a focus on its potential to augment core clinical practices rather than disrupt them.

The question of whether GenAI signals the “end” of physiotherapy is both timely and complex. While LLMs excel in knowledge-based tasks, physiotherapy’s reliance on manual skills, patient rapport, and individualized care suggests a complementary rather than substitutive role for AI [32]. In this article, we present the methodology, results, and discussion, comparing the findings with the related work in medical education and clinical practice and exploring the future applications of AI in physiotherapy.

2. Methodology

2.1. Study Design

This cross-sectional observational study compared the performance of ChatGPT (GPT-4) and DeepSeek and 90 final-year physiotherapy students in Greece. The study involved 60 clinical questions across four rehabilitation domains: low back pain, multiple sclerosis, frozen shoulder, and knee osteoarthritis (15 questions per domain), enabling a robust comparison in an education-focused setting. Figure 1 illustrates the methodological architecture.

2.2. Participants

Ninety students in their final year of a 4-year Bachelor of Science in Physiotherapy program at two Greek universities (University of Patras, n = 60; Metropolitan College, a private university in Patras, n = 30) participated. Their advanced training ensured familiarity with the targeted domains, making them a proxy for entry-level professionals. Participants were recruited voluntarily and provided informed consent. No exclusion criteria were applied beyond program enrollment. The study was approved by the University of Patras’ ethics committee, adhering to ethical guidelines.

2.3. Question Development

Sixty clinical questions were developed by three experienced physiotherapists specializing in musculoskeletal and neurological rehabilitation. Each domain included 15 questions across four subcategories:

Basic knowledge (4–5 questions): these questions covered etiology, pathophysiology, and epidemiology (e.g., “How is knee osteoarthritis diagnosed clinically and radiographically (e.g., X-ray, MRI)?”).
Diagnosis (3–4 questions): these questions focused on assessment techniques and diagnostic criteria (e.g., “Which standardized scales (e.g., EDSS, MSIS-29) do you use to quantify disability in MS patients?”).
Alternative treatments (3–4 questions): these questions addressed complementary therapies, such as acupuncture or hydrotherapy (e.g., “What alternative treatments benefit frozen shoulder?”).
Rehabilitation practices (3–4 questions): these questions emphasized evidence-based interventions, such as exercise or manual therapy (e.g., “Can Low Back Pain be prevented through lifestyle modifications or exercise?”).

The questions reflected real-world scenarios, requiring integration of theoretical knowledge, clinical reasoning, and CPGs [10]. They were pilot-tested with five practicing physiotherapists for clarity and relevance. The full listings, including examples and CPG alignments, are provided in Supplementary Materials S1.

2.4. Data Collection

Data collection occurred between March and May 2025. Students answered the 60 questions independently in a controlled and proctored setting over a 90 min session, simulating final exams no external resources (e.g., textbooks, internet) were permitted, and collaboration was prohibited. This contrasts with LLMs’ instantaneous and corpus-based generation, further highlighting the asymmetrical comparison [as discussed in the Limitations]. The same questions were input into ChatGPT (GPT-4) and DeepSeek using the default settings, without fine-tuning. The inputs were standardized for consistency. The responses were anonymized to prevent bias during evaluation.

2.5. LLM Query Protocol

ChatGPT (GPT-4), (version dated October 2024) and DeepSeek (V2, released January 2025) were accessed via official APIs (OpenAI Playground for GPT-4o; DeepSeek platform) on 15–20 April 2025. The prompts were standardized as single-turn, neutral queries: “As a physiotherapy expert, provide a comprehensive, evidence-based answer to: [full question text]. Base responses on current clinical guidelines.” The temperature = 0.7 for balanced creativity; the max tokens = 500. Each question was queried once per model (n = 60 total per LLM), selecting the initial output to simulate real-time use and control costs, although this may underrepresent the variability (SD < 0.1 in pilot multi-queries); multi-run averaging is recommended for high-stakes applications [3]. To mitigate the prompt sensitivity, the questions were input as neutral single-round prompts (e.g., “Provide a comprehensive, evidence-based response to: [question text]”), without iterative refinement or role-playing instructions. No multi-turn interactions occurred; the first output per model was selected as final, ensuring consistency but potentially underestimating optimized performance. The full prompt templates are available in the Supplementary Materials Figure S1.

2.6. Evaluation

Two independent raters, who were physiotherapists with over 10 years of clinical experience, evaluated the responses using a 5-point Likert scale (1 = poor, 5 = excellent) across five criteria: relevance, accuracy, clarity, completeness, and consistency with CPGs. CPG consistency (alignment with guidelines [28]) was scored separately but contributed equally to the global quality average. A composite “global quality” score was calculated by averaging the criteria scores with equal weighting (1:1:1:1:1), reflecting their collective contribution to the response utility in educational assessments [5]. This approach prioritizes simplicity and inter-rater consistency; however, in clinical practice, accuracy and CPG alignment may hold higher value than clarity, a nuance for future weighted models. The conceptual understanding was assessed separately on a 5-point scale, evaluating the explanation depth and correctness. The raters were blinded to the response origins (student vs. LLM) via anonymized IDs and shuffled presentation to mask stylistic cues (e.g., verbosity). The inter-rater reliability was high (Cohen’s κ = 0.82, 95% CI: 0.78–0.86 across criteria), supporting the score validity; discrepancies were resolved via consensus. ICC (2, 1) = 0.85 for global quality. The scoring criteria (relevance, accuracy, clarity, completeness, and consistency with CPGs) were selected to assess the response quality in a standardized manner. However, these criteria may inherently favor the structured and verbose nature of LLM outputs over the concise, exam-constrained responses of students, potentially introducing a framework bias that disadvantages the latter group.

2.7. Statistical Analysis

Descriptive statistics (mean, standard deviation) were computed for each group (students, ChatGPT, DeepSeek) across the five criteria and global quality for each domain and subcategory. The unit of analysis was the individual response (n = 180 per group post-aggregation), with per-student/question averages for group scores. Independence was assumed at the raters’ level (robust to nesting via Kruskal–Wallis); while the mixed-effects models could account for student clustering, aggregation minimized this (ICC < 0.1), aligning with similar studies [5]. One-way ANOVA or non-parametric tests (e.g., Kruskal–Wallis) compared the group performance, with post hoc tests (e.g., Tukey’s HSD) used to identify differences. The conceptual understanding was analyzed similarly. The significance was set at p < 0.05, with effect sizes (Cohen’s d) calculated. Analyses were conducted using Jamovi 2.6.44, statistical software. Statistical analyses assumed independence at the response level; however, the LLM responses were generated once per question and replicated across evaluations, introducing potential non-independence. To address this, we treated each rater’s score as an independent observation (n = 120 per group) and used robust non-parametric tests (Kruskal–Wallis) where normality assumptions were violated, minimizing the clustering effects.

3. Results

The study generated data from students and LLMs on 60 questions across four domains: low back pain, multiple sclerosis, frozen shoulder, and knee osteoarthritis. Table 2 presents the mean scores for the five evaluation criteria (relevance, accuracy, clarity, completeness, and global quality), and Table 3 summarizes the conceptual understanding scores. Post hoc error analysis (n = 6 questions/group/domain) revealed that the LLMs’ excellence in low back pain/knee osteoarthritis stemmed from comprehensive CPG citations (e.g., 90% coverage vs. students’ 60%), while students erred less in frozen shoulder diagnosis (e.g., accurate biomechanical tests in 80% vs. LLMs’ 70%), likely due to curriculum emphasis [23]. A post hoc ablation on 20 questions compared non-fine-tuned models to BioMedLM (fine-tuned on biomedical corpora [17]): the fine-tuned variants improved the CPG consistency by 15% (M = 4.6 vs. 4.1), particularly in multiple sclerosis (Cohen’s d = 0.4), suggesting targeted tuning enhances the relevance.

The post hoc adjustments confirmed the consistency, e.g., alternative treatments p < 0.001 across domains, matching method-specified tests (no violations of normality according to the Shapiro–Wilk test). ANOVA showed significant differences in the global quality across groups (F (2, 177) = 68.4, p < 0.001), with LLMs achieving higher scores in written response quality than students in all domains, except for the diagnosis for frozen shoulder, where students performed comparably (p = 0.12, Table 4) potentially reflecting their training in practical diagnostic skills. Post hoc Tukey’s HSD tests revealed that ChatGPT excelled in low back pain (M = 4.65, SD = 0.4) and knee osteoarthritis (M = 4.70, SD = 0.3), while DeepSeek led in multiple sclerosis (M = 4.70, SD = 0.3). For frozen shoulder diagnosis, the students’ mean global quality score (M = 3.90, SD = 0.5) was comparable to ChatGPT’s (M = 4.45, SD = 0.4, p = 0.15) and DeepSeek’s (M = 4.35, SD = 0.4, p = 0.18). Notably, for frozen shoulder diagnosis, no significant differences emerged (p = 0.12–0.18, Table 4), with students achieving comparable accuracy (M = 4.1 ± 0.6), underscoring their practical training advantages in this subcategory. The subcategory analyses (Table 4) showed significant LLM superiority in basic knowledge, alternative treatments, and rehabilitation practices across all domains (p < 0.001–0.003), with the largest differences in alternative treatments (Cohen’s d = 1.2–1.5). The effect sizes indicated large differences for the alternative treatment questions (Cohen’s d = 1.3 for low back pain, 1.5 for multiple sclerosis, 1.2 for frozen shoulder, 1.4 for knee osteoarthritis).

As shown in Figure 2, ChatGPT achieved the highest global quality scores in low back pain (M = 4.65, SD = 0.4) and knee osteoarthritis (M = 4.70, SD = 0.3), while DeepSeek led in multiple sclerosis (M = 4.70, SD = 0.3). The students performed comparably in the diagnosis for frozen shoulder (p = 0.12). The histograms in Figure 1 visually confirm that the LLMs outperformed students in terms of clarity (ChatGPT: M = 4.6–4.8; DeepSeek: M = 4.5–4.8; Students: M = 3.4–3.8) and completeness (ChatGPT: M = 4.4–4.7; DeepSeek: M = 4.4–4.7; Students: M = 3.3–3.7) across all domains, likely due to their structured and comprehensive responses [3]. DeepSeek showed particular strength in multiple sclerosis, possibly due to its advanced architecture [33]. The students excelled in diagnosis for musculoskeletal conditions, particularly frozen shoulder (M = 4.1 ± 0.6 for accuracy), reflecting their practical training [4]. For the alternative treatment questions, LLMs provided more comprehensive responses, with effect sizes indicating large differences (Cohen’s d = 1.3 for low back pain, 1.5 for multiple sclerosis, 1.2 for frozen shoulder, and 1.4 for knee osteoarthritis).

4. Discussion

4.1. Evidence-Based Findings

The superior performance of untrained versions of ChatGPT and DeepSeek in terms of their global quality and conceptual understanding of written responses underscores their potential as transformative tools in physiotherapy education and practice, although this does not equate to clinical expertise in areas such as manual therapy or patient rapport. These findings align with Wang et al. (2025) [14], who reported ChatGPT’s 80% adherence to CPGs in musculoskeletal rehabilitation, although it struggled with context-specific cases such as lumbosacral radicular pain [2,25]. DeepSeek’s strength in multiple sclerosis may be attributed to its enhanced context window and architecture, which allow for better handling of complex, domain-specific queries [14]. LLMs’ strengths in consistency (e.g., low SDs in Table 2 and Table 3) align with robustness techniques such as PSSCL’s progressive sample selection with contrastive loss [34] and UCRT’s uniform consistency selection for noisy training [35,36], which mitigate the variability in label-noisy scenarios. These frameworks contextualize our results, suggesting fine-tuned LLMs could further enhance reliability in physiotherapy queries. The students’ strong performance in diagnosis questions, particularly for musculoskeletal conditions such as frozen shoulder, reflects the practical hands-on training embedded in physiotherapy curricula, which emphasizes clinical assessment and patient interaction [4]. The lack of significant differences in frozen shoulder diagnosis is particularly noteworthy, likely reflecting students’ hands-on exposure to biomechanical assessments [12], where LLMs’ textual synthesis yields to experiential judgment. This suggests targeted LLM augmentation (e.g., for knowledge recall) alongside human strengths in diagnostics. The observed differences likely stem from LLMs’ optimization for textual completeness rather than students’ holistic skills, rendering the comparison exploratory rather than definitive [3,14]. This bias underscores the need for hybrid evaluations incorporating practical elements. LLMs excelled in knowledge-heavy subcategories (e.g., alternative treatments, Cohen’s d = 1.2–1.5) via vast training data enabling holistic synthesis [3], but they faltered in contextual diagnostics. The students’ comparability in frozen shoulder diagnosis reflects hands-on training in musculoskeletal assessment [4], where experiential intuition trumps textual recall—error patterns showed LLMs ‘hallucinating’ rare contraindications (5% rate) that were absent in student responses. Aligning with the 2025 findings [37,38,39], LLMs supplement clinical knowledge; however, our student benchmark reveals diagnostic parity gaps.

Comparisons with other health professions reveal both similarities and differences in AI’s role. In medical education, LLMs have been integrated into case-based learning, achieving accuracy rates of 70–90% on clinical vignettes, often surpassing third-year medical students in knowledge-based tasks [5]. For example, a study by Plackett et al. (2025) found that ChatGPT outperformed medical students in answering multiple-choice questions on pharmacology and pathology, though it struggled with open-ended clinical reasoning tasks [31]. In nursing education, AI-driven virtual patients have been used to simulate complex scenarios, improving students’ diagnostic and communication skills [28]. These findings parallel the current study’s results, where LLMs excelled in structured knowledge-based responses but lacked the nuanced clinical judgment developed through practical experience.

In allied health fields, such as occupational therapy and speech therapy, AI has been used to support evidence-based practice and patient education. For instance, AI tools in occupational therapy have been employed to generate personalized home exercise programs, improving patient adherence and outcomes [37]. In speech therapy, LLMs have been used to develop conversational agents that assist patients with language rehabilitation, offering real-time feedback and personalized exercises [34,38]. These applications suggest that LLMs could similarly enhance physiotherapy by automating routine tasks, such as generating patient education materials or documenting treatment plans, thereby allowing clinicians to focus on hands-on care.

In clinical practice, AI’s role extends beyond education to direct patient care. In orthopedics, LLMs have achieved 55–93% accuracy on examination questions, with performance varying based on question complexity and context [35]. For example, a physiotherapy-specific LLM trained on the CPGs and case studies could provide tailored recommendations for managing conditions such as low back pain or knee osteoarthritis, improving alignment with evidence-based practice.

While these findings offer insights into LLM augmentation in physiotherapy education, the student cohort—final-year BSc students from a single Greek university—may not fully represent broader contexts, such as international curricula or practicing clinicians. Implications for global education should thus be interpreted cautiously, pending multi-site validation.

4.2. Implications for AI-Augmented Learning

These results inform Vygotsky’s zone of proximal development [39], where LLMs scaffold knowledge recall (e.g., superior in basic/alternative subcategories, Table 5), freeing students for advanced reasoning—evidenced by diagnostic parity (Table 4) and error modes like LLM overconfidence in alternatives (5% hallucination rate). This generates hypotheses on failure modes (e.g., LLMs misleading in context-dependent rehab planning without experiential cues) and student strengths (e.g., in musculoskeletal diagnostics, p > 0.05), offering educators data for hybrid curricula: integrate LLMs for textual augmentation but prioritize hands-on modules for judgment. While exploratory, these insights address underexplored gaps in physiotherapy education [17], seeding equitable designs like RAG-constrained LLMs for balanced comparisons [40].

4.3. Future Implications

Emerging extensions, such as AI voice assistants and characters, while promising for interactive training [5,6,7,8,9], lie beyond this study’s textual scope and warrant dedicated empirical investigation (see Section 5).

The question of whether GenAI threatens physiotherapy is premature and oversimplified; instead, our findings suggest it could transform practice through augmentation, without supplanting essential skills such as manual therapy or patient rapport. Claims of professional disruption are not supported by this study, which did not evaluate hands-on or interactive elements. As McComiskie (2023) argues, physiotherapy’s core strengths—manual therapy, patient rapport, and individualized care—remain inherently human-centric [32]. LLMs and other AI tools can augment practice by streamlining administrative tasks, generating evidence-based recommendations, and supporting patient education [4]. For example, AI could automate the creation of home exercise programs, reducing clinician workload while improving patient engagement.

Beyond generation, LLMs face deployment hurdles, such as long-term stability (e.g., model drift from updates) and adaptability to variables (e.g., patient comorbidities). ‘Calibration’ techniques—aligning AI confidence with accuracy—are vital [19]. Drawing from structural health monitoring, Liu et al.’s (2025) temperature compensation via ultrasonic waves [41] inspires analogous mechanisms for physiotherapy AI, such as real-time adjustments for disease progression or environmental factors, enhancing safety in dynamic clinical contexts.

To mitigate hallucinations (e.g., fabricated CPGs in 5% of LLM responses per error analysis), clinicians/educators can employ prompt engineering—e.g., “Respond only with evidence from [specific guideline]”—or cross-verify with tools such as PubMed [21]. For data privacy, one can adhere to GDPR by using anonymized queries and on-device models; practical guidance includes AI literacy training in curricula to foster critical evaluation, ensuring human oversight in patient-facing applications [29,42]. Explore RAG frameworks for balanced comparisons, emphasizing uncertainty in safety-critical decisions [40].

4.4. Practical Safe Use Guidelines

To leverage LLMs safely, educators should integrate them as knowledge aids (e.g., for query drafting) with student verification against CPGs [28]; clinicians might use them for initial planning but always confirm via hands-on assessment. One could start with low-risk tasks, monitor for hallucinations (e.g., via dual review), and prioritize AI literacy training to sustain human oversight—enhancing, not replacing, core skills [29,32]. For instance, deploy basic knowledge drills (where LLMs excel, d = 1.2–1.5) but defer diagnostics to experiential practice, mitigating misleading textual polish in nuanced scenarios.

The findings’ generalizability is constrained by the participant pool—final-year students from a single Greek university whose training may differ from international curricula or practicing clinicians’ experience. For instance, students’ exam-like responses may not reflect the adaptive decision-making of professionals in diverse settings. The Greece-centric design limits generalizability, as cultural factors (e.g., patient interaction norms) or curricular variances (e.g., greater neurology focus elsewhere) could modulate outcomes; international replications are essential.

4.5. Limitations

The study’s sample size (n = 90) and focus on students from two institutions in one country limit its generalizability to practicing physiotherapists or global contexts. Variations in educational systems, cultural factors, or clinical exposure could alter the comparative performance; hence, broader multi-site studies are warranted. Although spanning two institutions, the sample remains student-focused; redesigns with multi-level learners or professionals could reveal progression effects. The design compares non-equivalent entities (textual AI vs. constrained learners), excluding vital physiotherapy elements such as patient interaction, manual skills, physical assessment, real-world decision-making, and clinician perspectives. This limits the evidence for clinical claims and overstates the implications; the patient outcomes were not measured, further confining the generalizability. The 60 questions, while comprehensive, exclude physical assessment skills, which are critical to physiotherapy practice. LLMs were not fine-tuned, potentially affecting their performance in nuanced scenarios. Subjective evaluation criteria may introduce bias, despite reliability measures such as Cohen’s kappa. The reliance on two raters’ Likert-scale assessments, without a gold-standard reference (e.g., pre-validated expert responses) or objective metrics (e.g., BLEU/ROUGE for completeness), introduces subjectivity risks. Future work should incorporate automated NLP benchmarks for enhanced objectivity. Additionally, the evaluation framework’s emphasis on clarity and completeness may bias the results toward LLMs, which generate expansive polished text, while the students were limited by time-constrained, written exam conditions without access to resources. This could underestimate student performance in real-world scenarios where they leverage practical experience or references. Future studies should incorporate balanced criteria, such as efficiency or adaptability, to mitigate this. Fundamentally, the LLM student comparison is asymmetrical and potentially unfair, as LLMs benefit from pretrained knowledge and polish without human constraints such as fatigue or time pressure. This favors criteria such as clarity and completeness, potentially inflating the LLM scores; redesigns matching conditions (e.g., LLM ‘fatigue’ simulations) are essential for validity. A further limitation is the non-independence of LLM responses, as each was generated once per question rather than independently for each ‘participant.’ This may inflate the effect sizes and p-values; future studies should employ bootstrapping or mixed-effects models to account for this dependency. LLM responses were generated via basic prompts, which may not capture peak robustness; sensitivity to phrasing [3] could alter the results, warranting sensitivity analyses in future work.

4.6. Future Directions

Future studies should test LLMs in clinical settings, incorporating physical assessments and real-world patient interactions. Fine-tuning LLMs with physiotherapy-specific data could enhance their accuracy and relevance [19]. Explore RAG frameworks for balanced comparisons, emphasizing uncertainty in safety-critical decisions [40]. AI voice assistants and characters should be evaluated for their ability to improve patient adherence and student training outcomes [6,8]. Longitudinal studies are needed to assess AI’s impact on patient outcomes, such as recovery rates and quality of life. Additionally, exploring AI’s role in inter-professional collaboration, such as coordinating care between physiotherapists, physicians, and occupational therapists, could further enhance its utility. Future Directions for Emerging AI Technologies: Although not evaluated here, AI voice assistants (e.g., for real-time exercise guidance) and AI characters (e.g., as virtual patients for training) hold promise based on related fields [15,16]. Speculatively, these could enhance physiotherapy by improving adherence and simulation-based learning, but empirical testing in domain-specific contexts is needed. Researchers could conduct cross-cultural studies to assess LLM utility amid diverse educational systems and develop ‘compensation’ protocols [15,41] for LLM stability in variable rehabilitation scenarios. They could also conduct full ablation studies with physiotherapy-fine-tuned LLMs (e.g., on CPG datasets) to quantify the gains over base models. Future research should explore several key areas:

AI Voice Assistants: evaluate their effectiveness in delivering real-time rehabilitation guidance, particularly for home-based programs, and their impact on patient adherence and outcomes.
AI Characters: investigate their use as virtual patients in physiotherapy training, assessing their impact on clinical reasoning, empathy, and student confidence.
Clinical Integration: test LLMs in real-world physiotherapy settings, incorporating patient-specific factors such as comorbidities or psychosocial barriers.
Fine-Tuning: develop physiotherapy-specific LLMs using CPGs, clinical case studies, and real-world data to enhance accuracy and relevance.
Long-Term Impact: assess AI’s effects on patient outcomes, such as recovery rates, functional improvements, and patient satisfaction.

Though untested here, AI voice assistants could enable hands-free rehabilitation guidance [6,7], and AI characters might simulate patient interactions for training [8,9]. These warrant physiotherapy-specific trials to quantify the impacts on adherence and clinical reasoning, distinct from the knowledge-focused LLM evaluation in this study [16].

5. Conclusions

ChatGPT and DeepSeek achieved a higher written response quality than students in knowledge-based clinical question-answering tasks focused on written knowledge and reasoning, highlighting GenAI’s potential in physiotherapy education and practice as an adjunct to, rather than a substitute for, human clinical judgment. The key contributions include demonstrating LLMs’ superior written knowledge synthesis in rehabilitation queries—offering a foundation for educational tools—while affirming students’ diagnostic edges, thus guiding ethical AI integration beyond general clinical support. AI voice assistants and characters offer speculative avenues for enhancing accessibility and training. However, without assessing the clinical performance, outcomes, or clinician input, these findings do not support disruption narratives. Physiotherapy’s human core—manual skills, empathy, and individualized care—remains irreplaceable, positioning GenAI as an enhancement rather than a threat. Ethical integration and domain-specific AI development are crucial to maximizing the benefits while addressing risks such as hallucinations and data privacy. By leveraging AI as a complementary tool, physiotherapy can evolve to meet the demands of modern healthcare while preserving its human-centric foundation.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16031165/s1, S1: Declaration of consent and questionnaire, Figure S1: Histograms of mean global quality scores (1–5 scale, with SD error bars) across domains, post-sample expansion.

Author Contributions

Conceptualization, C.K. and D.C.; methodology, S.M.; validation, A.F., S.S. and I.M.; formal analysis, S.M.; investigation, I.M. and M.T.; resources, D.C.; writing—original draft preparation, I.M.; writing—review and editing, A.F. and M.T.; visualization, D.C.; supervision, C.K. All authors have read and agreed to the published version of the manuscript.

Funding

No funding was received for this research.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Calderone, A.; Perin, P.; Orsenigo, C.; Turolla, A. The impact of artificial intelligence on diagnosis and treatment of neurological disorders. Biomedicines 2024, 12, 2415. [Google Scholar] [CrossRef]
Safran, E.; Yildirim, S. A cross-sectional study on ChatGPT’s alignment with clinical practice guidelines in musculoskeletal rehabilitation. BMC Musculoskelet. Disord. 2025, 26, 411. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Davids, J.; Lidströmer, N.; Ashrafian, H. Artificial Intelligence for Physiotherapy and Rehabilitation; Springer eBooks; Springer: Berlin/Heidelberg, Germany, 2021; pp. 1–19. Available online: https://link.springer.com/rwe/10.1007/978-3-030-58080-3_339-1 (accessed on 15 October 2025).
Mavrych, V.; Yousef, E.M.; Yaqinuddin, A.; Bolgova, O. Large language models in medical education: A comparative cross-platform evaluation in answering histological questions. Med. Educ. Online 2025, 30, 2534065. [Google Scholar] [CrossRef] [PubMed]
Salam, M.A.; Imtiaz, S.; Lucy, I.B. Artificial Intelligence in Medical Education: Opportunities and Challenges. Bangladesh J. Infect. Dis. 2025, 12, 189–194. [Google Scholar] [CrossRef]
Lowe, S.W. The role of artificial intelligence in Physical Therapy education. Bull. Fac. Phys. Ther. 2024, 29, 13. [Google Scholar] [CrossRef]
Gürses, Ö.A.; Özüdoğru, A.; Tuncay, F.; Kararti, C. The Role of Artificial Intelligence Large Language Models in Personalized Rehabilitation Programs for Knee Osteoarthritis: An Observational Study. J. Med. Syst. 2025, 49, 73. [Google Scholar] [CrossRef]
Bitterman, J.; D’Angelo, A.; Holachek, A.; Eubanks, J.E. Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions. PM R 2025, 17, 1091–1096. [Google Scholar] [CrossRef]
Koes, B.W.; van Tulder, M.; Thomas, S. Diagnosis and treatment of low back pain. BMJ 2006, 332, 1430–1434. [Google Scholar] [CrossRef]
Compston, A.; Coles, A. Multiple sclerosis. Lancet 2008, 372, 1502–1517. [Google Scholar] [CrossRef]
Kelley, B.J.; Rodriguez, M. Frozen shoulder: Evidence and a proposed model guiding rehabilitation. J. Orthop. Sports Phys. Ther. 2009, 39, 135–148. [Google Scholar] [CrossRef]
McAlindon, T.E.; Bannuru, R.R.; Sullivan, M.C. OARSI guidelines for the non-surgical management of knee osteoarthritis. Osteoarthr. Cartil. 2014, 22, 363–388. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Wang, Y.; Jiang, L.; Chang, Y.; Zhang, S.; Zhao, K.; Chen, L.; Gao, C. Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation. Eur. J. Med. Res. 2025, 30, 45. [Google Scholar] [CrossRef] [PubMed]
Arbel, Y.; Gimmon, Y.; Shmueli, L. Evaluating the Potential of Large Language Models for Vestibular Rehabilitation Education: A Comparison of ChatGPT, Google Gemini, and Clinicians. Phys. Ther. 2025, 105, pzaf010. [Google Scholar] [CrossRef] [PubMed]
Lai, X.; Chen, J.; Lai, Y.; Huang, S.; Cai, Y.; Sun, Z.; Wang, X.; Pan, K.; Gao, Q.; Huang, C. Using Large Language Models to Enhance Exercise Recommendations and Physical Activity in Clinical and Healthy Populations: Scoping Review. JMIR Med. Inform. 2025, 13, e59309. [Google Scholar] [CrossRef]
Hao, J.; Yao, Z.; Tang, Y.; Remis, A.; Wu, K.; Yu, X. Artificial Intelligence in Physical Therapy: Evaluating ChatGPT’s Role in Clinical Decision Support for Musculoskeletal Care. Ann. Biomed. Eng. 2025, 53, 9–13. [Google Scholar] [CrossRef]
Zhang, C.; Liu, S.; Zhou, X.; Zhou, S.; Tian, Y.; Wang, S.; Xu, N.; Li, W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J. Med. Internet Res. 2024, 26, e59607. [Google Scholar] [CrossRef]
Ermolina, A.; Tiberius, V. Voice-Controlled Intelligent Personal Assistants in Health Care: International Delphi Study. J. Med. Internet Res. 2021, 23, e25312. [Google Scholar] [CrossRef]
Khalid, U.B.; Naeem, M.; Stasolla, F.; Syed, M.H.; Abbas, M.; Coronato, A. Impact of AI-Powered Solutions in Rehabilitation Process: Recent Improvements and Future Trends. Int. J. Gen. Med. 2024, 17, 943–969. [Google Scholar] [CrossRef]
Hatem, R.; Simmons, B.; Thornton, J.E. A call to address AI “hallucinations” and how healthcare professionals can mitigate their risks. Cureus 2023, 15, e44720. [Google Scholar] [CrossRef]
Zidoun, Y.; Mardi, A.E. Artificial Intelligence (AI)-Based simulators versus simulated patients in undergraduate programs: A protocol for a randomized controlled trial. BMC Med. Educ. 2024, 24, 1260. [Google Scholar] [CrossRef]
O’Connor, S. Virtual Reality and Avatars in Health care. Clin. Nurs. Res. 2019, 28, 523–528. [Google Scholar] [CrossRef] [PubMed]
Foronda, C.L.; Fernandez-Burgos, M.; Nadeau, C.; Kelley, C.N.; Henry, M.N. Virtual Simulation in Nursing Education: A Systematic Review Spanning 1996 to 2018. Simul. Healthc. 2020, 15, 46–54. [Google Scholar] [CrossRef] [PubMed]
Buch, V.H.; Ahmed, I.; Maruthappu, M. Artificial intelligence in medicine: Current trends and future possibilities. Br. J. Gen. Pract. 2018, 68, 143–144. [Google Scholar] [CrossRef] [PubMed]
Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.P.; et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 2018, 15, e1002686. [Google Scholar] [CrossRef]
Attia, Z.I.; Noseworthy, P.A.; Lopez-Jimenez, F.; Asirvatham, S.J.; Deshmukh, A.J.; Gersh, B.J.; Carter, R.E.; Yao, X.; Rabinstein, A.A.; Erickson, B.J.; et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: A retrospective analysis of outcome prediction. Lancet 2019, 394, 861–867. [Google Scholar] [CrossRef]
Plater, J.C.; Baxter, G.D.; Wood, L.C.; Mueller, J.; Fisher, T. Development of evidence-based standards for inpatient physiotherapy services: A systematic review and content analysis of clinical practice guidelines. BMJ Open 2024, 14, e088692. [Google Scholar] [CrossRef]
Topol, E.J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again; Basic Books: New York, NY, USA, 2023. [Google Scholar]
Laranjo, L.; Dunn, A.G.; Tong, H.L.; Kocaballi, A.B.; Chen, J.; Bashir, R.; Surian, D.; Gallego, B.; Magrabi, F.; Lau, A.Y.S.; et al. Conversational agents in healthcare: A systematic review. J. Am. Med. Inform. Assoc. 2018, 25, 1248–1258. [Google Scholar] [CrossRef]
Plackett, R.; Kassianos, A.P.; Mylan, S.; Kambouri, M.; Raine, R.; Sheringham, J. The effectiveness of using virtual patient educational tools to improve medical students’ clinical reasoning skills: A systematic review. BMC Med. Educ. 2022, 22, 365. [Google Scholar] [CrossRef]
McComiskie, E. AI: The Future of Physio? The Chartered Society of Physiotherapy. 2023. Available online: https://www.csp.org.uk/frontline/article/ai-future-physio (accessed on 20 October 2025).
Singh, S.; Bansal, S.; Saddik, A.; Saini, M. From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models. arXiv 2025, arXiv:2504.03219. [Google Scholar] [CrossRef]
Green, J. Artificial intelligence in communication sciences and disorders: Introduction to the forum. J. Speech Lang. Hear. Res. 2024, 67, 3093–3097. [Google Scholar] [CrossRef]
Zhang, Q.; Zhu, Y.; Cordeiro, F.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, Q. A Two-Stage Noisy Label Learning Framework with Uniform Consistency Selection and Robust Training. Appl. Intell. 2026, 56, 21. [Google Scholar] [CrossRef]
Bulan, P.M.P.; Kuizon, D.A.Y.; Casaña, R.S.E.; Fuentes, C.G.; Pestaño, N.Y.; Suerte, J.R.O. A Scoping Review on Artificial Intelligence in Occupational Therapy. OTJR 2025. Online ahead of print. [Google Scholar] [CrossRef]
Masters, K. Submitting artificial intelligence in health professions education papers to Medical Teacher. Med. Teach. 2024, 46, 1256–1257. [Google Scholar] [CrossRef]
Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Harvard University Press: Cambridge, MA, USA, 1978. [Google Scholar]
Meskó, B.; Topol, E.J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digit. Med. 2023, 6, 120. [Google Scholar] [CrossRef]
Liu, W.; Hu, J.; Lv, F.; Tang, Z. A new method for long-term temperature compensation of structural health monitoring by ultrasonic guided wave. Measurement 2025, 252, 117310. [Google Scholar] [CrossRef]
European Union. General Data Protection Regulation (GDPR). 2016. Available online: https://gdpr.eu (accessed on 25 November 2025).

Figure 1. Overview of study pipeline, highlighting asymmetrical LLM-student arms.

Figure 2. Histograms of mean global quality scores (1–5 scale, with SD error bars) across domains, post-sample expansion.

Table 1. Key studies on LLMs in rehabilitation and physiotherapy.

Study	Focus	Methods	Key Findings	Distinction from Our Work
Safran and Yildirim (2025) [2]	CPG alignment in musculoskeletal rehabilitation	LLM evaluation on 50 queries	80% adherence, gaps in radicular pain	Domain-specific (physiology students vs. LLMs in multi-domain open questions)
Wang et al. (2025) [14]	Lumbar disc herniation management	ChatGPT-4o on vignettes	High accuracy in basics, low in personalization	Includes student comparison, broader rehabilitation domains
Mavrych et al. (2025) [5]	Histological Q&A in medical education	Cross-platform LLM evaluation	GPT-4o > students in accuracy	Physiotherapy-specific, conceptual depth focus
Gürses et al. (2025) [8]	Personalized rehabilitation programs for knee OA	Observational study on LLMs	Role in enhancing adherence and personalization	Student-LLM benchmark in education, not just programs
Bitterman et al. (2025) [9]	PM&R question accuracy	LLM vs. board review questions	LLMs 85% accurate in basics	Direct student comparison under exam constraints
Arbel et al. (2025) [15]	LLMs vs. clinicians in vestibular rehabilitation	Comparative response evaluation	LLMs comparable in routine cases	Focus on students as entry-level proxies
Lai et al. (2025) [16]	LLM-enhanced exercise recommendations	Scoping review of prompts	Improved personalization in clinical populations	Exploratory written quality in diverse domains
Hao et al. (2025) [17]	ChatGPT accuracy in PT decision support	Clinical query testing for musculoskeletal care	78% alignment with guidelines	Multi-LLM and student-inclusive design
Zhang et al. (2024) [18]	LLMs in orthopedics	Exam question evaluation	55–93% accuracy	Broader rehabilitation and conceptual understanding emphasis
Lowe (2024) [7]	AI in PT education	Review of integration	Potential for case-based learning	Empirical benchmarking vs. review

Table 2. Mean scores (1–5) for response quality across rehabilitation domains (updated for n = 90 students).

Domain	Group	Relevance	Accuracy	Clarity	Completeness	CPG Consistency	Global Quality
Low Back Pain	Students	3.8 ± 0.6	3.7 ± 0.7	3.6 ± 0.6	3.5 ± 0.7	3.6 ± 0.7	3.65 ± 0.6
	ChatGPT	4.6 ± 0.4	4.5 ± 0.5	4.8 ± 0.3	4.7 ± 0.4	4.5 ± 0.5	4.65 ± 0.4
	DeepSeek	4.4 ± 0.5	4.3 ± 0.5	4.6 ± 0.4	4.5 ± 0.5	4.4 ± 0.5	4.45 ± 0.4
Multiple Sclerosis	Students	3.6 ± 0.7	3.5 ± 0.8	3.4 ± 0.7	3.3 ± 0.8	3.5 ± 0.8	3.45 ± 0.7
	ChatGPT	4.3 ± 0.5	4.2 ± 0.6	4.5 ± 0.4	4.4 ± 0.5	4.2 ± 0.6	4.35 ± 0.5
	DeepSeek	4.7 ± 0.3	4.6 ± 0.4	4.8 ± 0.3	4.7 ± 0.3	4.6 ± 0.4	4.70 ± 0.3
Frozen Shoulder	Students	4.0 ± 0.5	4.1 ± 0.6	3.8 ± 0.6	3.7 ± 0.6	4.0 ± 0.5	3.90 ± 0.5
	ChatGPT	4.4 ± 0.4	4.3 ± 0.5	4.6 ± 0.4	4.5 ± 0.4	4.3 ± 0.5	4.45 ± 0.4
	DeepSeek	4.3 ± 0.5	4.2 ± 0.5	4.5 ± 0.4	4.4 ± 0.5	4.2 ± 0.5	4.35 ± 0.4
Knee Osteoarthritis	Students	3.7 ± 0.6	3.6 ± 0.7	3.5 ± 0.6	3.4 ± 0.7	3.6 ± 0.7	3.55 ± 0.6
	ChatGPT	4.7 ± 0.3	4.6 ± 0.4	4.8 ± 0.3	4.7 ± 0.3	4.6 ± 0.4	4.70 ± 0.3
	DeepSeek	4.5 ± 0.4	4.4 ± 0.5	4.6 ± 0.4	4.5 ± 0.4	4.4 ± 0.5	4.50 ± 0.4

Table 3. Conceptual understanding scores (1–5) across rehabilitation domains (updated for n = 90).

Domain	Students	ChatGPT	DeepSeek
Low Back Pain	3.7 ± 0.6	4.6 ± 0.4	4.4 ± 0.5
Multiple Sclerosis	3.4 ± 0.7	4.3 ± 0.5	4.7 ± 0.3
Frozen Shoulder	3.9 ± 0.5	4.4 ± 0.4	4.3 ± 0.5
Knee Osteoarthritis	3.6 ± 0.6	4.7 ± 0.3	4.5 ± 0.4

Table 4. p-values for subcategory comparisons across rehabilitation domains.

Domain	Subcategory	p-Value (ANOVA/Kruskal–Wallis)	Post Hoc (Students vs. ChatGPT)	Post Hoc (Students vs. DeepSeek)
Low Back Pain	Basic Knowledge	<0.001	<0.001	<0.001
	Diagnosis	0.002	0.003	0.005
	Alternative Treatments	<0.001	<0.001	<0.001
	Rehabilitation Practices	<0.001	<0.001	<0.001
Multiple Sclerosis	Basic Knowledge	<0.001	<0.001	<0.001
	Diagnosis	0.001	0.002	<0.001
	Alternative Treatments	<0.001	<0.001	<0.001
	Rehabilitation Practices	<0.001	<0.001	<0.001
Frozen Shoulder	Basic Knowledge	0.001	0.002	0.003
	Diagnosis	0.12	0.15	0.18
	Alternative Treatments	<0.001	<0.001	<0.001
	Rehabilitation Practices	0.002	0.003	0.004
Knee Osteoarthritis	Basic Knowledge	<0.001	<0.001	<0.001
	Diagnosis	0.003	0.004	0.006
	Alternative Treatments	<0.001	<0.001	<0.001
	Rehabilitation Practices	<0.001	<0.001	<0.001

Note: p < 0.05 significant; non-parametric where needed.

Table 5. Subcategory-level means (SD) and variability (updated for n = 90).

Domain	Subcategory	Students	ChatGPT	DeepSeek
Low Back Pain	Basic Knowledge	3.8 (0.6)	4.6 (0.4)	4.4 (0.5)
	Diagnosis	3.9 (0.6)	4.5 (0.4)	4.3 (0.5)
	Alternative Treatments	3.4 (0.8)	4.6 (0.4)	4.5 (0.5)
	Rehabilitation Practices	3.6 (0.7)	4.7 (0.3)	4.5 (0.4)
Multiple Sclerosis	Basic Knowledge	3.6 (0.7)	4.3 (0.5)	4.7 (0.3)
	Diagnosis	3.7 (0.7)	4.2 (0.6)	4.6 (0.4)
	Alternative Treatments	3.3 (0.8)	4.4 (0.5)	4.7 (0.3)
	Rehabilitation Practices	3.5 (0.8)	4.4 (0.5)	4.7 (0.3)
Frozen Shoulder	Basic Knowledge	4.0 (0.5)	4.4 (0.4)	4.3 (0.5)
	Diagnosis	4.1 (0.6)	4.3 (0.5)	4.2 (0.5)
	Alternative Treatments	3.7 (0.6)	4.5 (0.4)	4.4 (0.5)
	Rehabilitation Practices	3.8 (0.6)	4.5 (0.4)	4.4 (0.5)
Knee Osteoarthritis	Basic Knowledge	3.7 (0.6)	4.7 (0.3)	4.5 (0.4)
	Diagnosis	3.8 (0.7)	4.6 (0.4)	4.4 (0.5)
	Alternative Treatments	3.4 (0.7)	4.7 (0.3)	4.5 (0.4)
	Rehabilitation Practices	3.5 (0.7)	4.7 (0.3)	4.5 (0.4)

Note: SD in parentheses; based on aggregated per-question scores.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Michou, I.; Fouras, A.; Chrysanthakopoulou, D.; Theodoritsi, M.; Mariettou, S.; Stellatou, S.; Koutsojannis, C. AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios. Appl. Sci. 2026, 16, 1165. https://doi.org/10.3390/app16031165

AMA Style

Michou I, Fouras A, Chrysanthakopoulou D, Theodoritsi M, Mariettou S, Stellatou S, Koutsojannis C. AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios. Applied Sciences. 2026; 16(3):1165. https://doi.org/10.3390/app16031165

Chicago/Turabian Style

Michou, Ioanna, Athanasios Fouras, Dionysia Chrysanthakopoulou, Marina Theodoritsi, Savina Mariettou, Sotiria Stellatou, and Constantinos Koutsojannis. 2026. "AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios" Applied Sciences 16, no. 3: 1165. https://doi.org/10.3390/app16031165

APA Style

Michou, I., Fouras, A., Chrysanthakopoulou, D., Theodoritsi, M., Mariettou, S., Stellatou, S., & Koutsojannis, C. (2026). AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios. Applied Sciences, 16(3), 1165. https://doi.org/10.3390/app16031165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Powered Physiotherapy: Evaluating LLMs Against Students in Clinical Rehabilitation Scenarios

Abstract

1. Introduction

2. Methodology

2.1. Study Design

2.2. Participants

2.3. Question Development

2.4. Data Collection

2.5. LLM Query Protocol

2.6. Evaluation

2.7. Statistical Analysis

3. Results

4. Discussion

4.1. Evidence-Based Findings

4.2. Implications for AI-Augmented Learning

4.3. Future Implications

4.4. Practical Safe Use Guidelines

4.5. Limitations

4.6. Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI