1. Introduction
Artificial intelligence (AI) is increasingly transforming healthcare, with applications ranging from diagnostic assistance to perioperative management and rehabilitation planning. Recent studies have demonstrated the ability of AI-driven models to improve decision-making in surgical and postoperative care, particularly through large language models (LLMs) and natural language processing (NLP) techniques [
1,
2]. For instance, AI has been successfully applied in thoracic surgery to enhance diagnostic precision and predict postoperative complications, underscoring its expanding role in surgical specialties. Given the growing body of evidence supporting AI’s integration into clinical workflows, this study aims to assess the feasibility of using LLMs for generating personalized rehabilitation programs for patients undergoing major head and neck surgery [
1,
2,
3]. Among AI-driven tools, large language models (LLMs), such as ChatGPT, DeepSeek, Gemini 2, and Copilot, have gained significant attention for their ability to generate medical information and answer clinical questions based on extensive datasets [
4,
5].
LLMs in medicine are trained on large repositories of medical literature, clinical guidelines, and anonymized patient case information, allowing them to synthesize complex information and generate sophisticated and structured responses that align with evidence-based practices [
4,
5]. While LLMs are already being used to streamline administrative tasks and automate clinical documentation, there is growing interest in their potential beyond these functions, particularly in developing tailored patient rehabilitation programs for patients recovering from complex surgeries [
6].
Head and neck surgery (HNS) often involves oncological patients who require extensive surgical intervention and complex reconstructions, particularly in cases of squamous cell carcinoma (SCC), basal cell carcinoma (BCC), and other aggressive cutaneous or mucosal malignancies. These procedures range from simple local excisions to complex reconstructions involving microvascular free flaps, mandibulectomies, and radical neck dissections. While these surgeries are often lifesaving, they are also associated with significant morbidity, frequently resulting in functional impairments that affect speech, swallowing, mastication, airway patency, and facial nerve function [
7]. The severity of these postoperative complications depends on patient factors, the extent of tissue resection, and the need for adjuvant radiotherapy or chemotherapy treatment, which can further impact function and prolong recovery [
8]. Microvascular free flap reconstructions, while providing vascularized soft tissue and bony support, are not without risk, with higher complication rates in previously irradiated fields [
8,
9]. Additionally, surgical site infections, wound dehiscence, hematomas, and percutaneous fistulae are known postoperative challenges following extensive resections, often requiring prolonged wound management and staged interventions [
7,
10].
Given this, rehabilitation following HNS must be highly individualized and often requires a multidisciplinary approach with input from surgeons, speech therapists, dietitians, physiotherapists, and psychologists to optimize functional and aesthetic outcomes [
11]. Current rehabilitation protocols for HNS patients are primarily guided by standard clinical pathways and expert consensus [
9,
10]. However, these approaches may not fully account for patient-specific variables, such as age, comorbidities, tumor location, and the extent of resection. This is an area where AI and LLMs could offer a novel solution by utilizing large sets of data to predict patient rehabilitation trajectories, recommend structured postoperative care instructions, and provide real-time clinical support to surgeons, ultimately enhancing patient outcomes.
This study aims to assess the feasibility of using LLMs to generate personalized rehabilitation programs for patients undergoing HNS by comparing the recommendations of multiple LLMs against expert-reviewed standards. By doing so, this study seeks to determine whether AI-generated rehabilitation programs can complement traditional approaches and improve postoperative outcomes. Additionally, this study will assess the readability and quality of LLM-generated responses using validated tools, such as the DISCERN score and Flesch Reading Ease Score, to provide insights into their applicability in patient education and clinical decision-making.
2. Materials and Methods
This experimental study was designed to assess the capacity of LLMs to generate personalized rehabilitation programs for patients undergoing head and neck surgery. Given that the study utilized hypothetical clinical scenarios rather than real patient data, formal ethical approval was not required. However, all research activities adhered to the ethical principles of the Declaration of Helsinki.
Ten hypothetical clinical scenarios were developed to encompass a range of head and neck surgical procedures. These scenarios were designed by three senior clinicians with expertise in head and neck surgery, ensuring clinical relevance and representativeness. Each scenario was based on real-world cases commonly encountered in surgical practice, incorporating key variables such as patient demographics, comorbidities, tumor location, extent of resection, and reconstructive techniques. The scenarios were iteratively refined through expert review to ensure they reflected a broad spectrum of complexity and postoperative rehabilitation needs. To standardize AI model evaluation, a consistent prompt structure was used for all scenarios, focusing on the recommended rehabilitation program and necessary supportive care measures.
The following prompt was added post-hypothetical scenario: “What rehabilitation program and services are recommended for this patient to optimize their postoperative recovery, considering their specific clinical condition, surgical procedure, and potential risk factors?”. These scenarios were formulated by senior clinicians with expertise in head and neck surgery, ensuring their relevance and clinical validity.
2.1. Generation of Rehabilitation Programs
Four state-of-the-art LLMs, ChatGPT-4o, DeepSeek V3, Gemini 2, and Copilot, were selected based on their relevance and accessibility for medical applications. The choice of these models aimed to ensure a balanced comparison between well-established AI tools and emerging alternatives, allowing for a comprehensive evaluation of their suitability in generating rehabilitation programs. ChatGPT-4o was included due to its demonstrated accuracy in medical contexts and its strong performance in previous studies assessing AI-generated clinical recommendations [
1,
2,
3,
4,
5]. DeepSeek V3 was selected as an emerging model designed for complex reasoning, offering a distinct approach to AI-generated text with an emphasis on scientific and technical content. Gemini 2, developed by Google, was incorporated for its strong capabilities in language understanding and its optimization for producing highly readable outputs, which could be beneficial in patient-centered rehabilitation programs. Copilot, integrated within Microsoft’s ecosystem, was chosen to evaluate how general-purpose AI models perform in specialized medical tasks. By including these four LLMs, this study aims to capture a broad spectrum of AI capabilities, from established models optimized for clinical reasoning to newer competitors designed for diverse applications. To ensure consistency in evaluation, all models were prompted simultaneously on 10 February 2025, using identical clinical scenarios.
2.2. Evaluation of LLM-Generated Rehabilitation Programs
Three senior clinicians (WMR, SN, and RC), with over 40 years of cumulative experience, independently assessed the rehabilitation programs suggested by the LLMs for accuracy, clinical relevance, and appropriateness. The clinicians were selected based on their expertise in head and neck surgery and postoperative rehabilitation, each with over ten years of experience in tertiary academic centers. Their selection aimed to ensure a high level of clinical judgment and familiarity with current rehabilitation protocols. To minimize bias, evaluations were conducted independently, and the inter-rater agreement was analyzed to assess consistency in their assessments. Similar studies evaluating AI-generated medical recommendations have also relied on small expert panels for initial validation [
9]. Future research may benefit from expanding the number of reviewers to enhance generalizability and account for inter-rater variability. The assessments were conducted using a five-point Likert scale (1 = Poor, 5 = Excellent) to measure the quality of the responses quantitatively.
To further assess the readability, clarity, and reliability of the generated rehabilitation plans, standardized text analysis metrics were applied, including the following:
DISCERN Score—evaluating the reliability and quality of health information.
Flesch Reading Ease Score—measuring the ease of comprehension.
Flesch–Kincaid Grade Level—indicating the educational level required to understand the text.
Coleman–Liau Index—assessing the complexity of the generated text.
2.3. Data Analysis
The Likert scale ratings assigned by the two senior clinicians were analyzed for inter-rater reliability using Cohen’s kappa coefficient to assess consistency in evaluation. A descriptive statistical analysis was performed to compare the mean scores of each LLM, identifying trends in accuracy and clinical utility. A comparative study of the readability and quality assessment metrics was also conducted, examining model variations and their suitability for patient and clinician use.
4. Discussion
This study explores the potential role of large language models (LLMs) in generating rehabilitation programs for patients undergoing head and neck surgery. While AI has been increasingly applied in clinical decision support and administrative tasks, its use in postoperative rehabilitation planning remains an emerging area. By comparing multiple LLMs (ChatGPT-4o, DeepSeek V3, Gemini 2, and Copilot) using standardized clinical scenarios and expert evaluation, this study provides insights into the strengths and limitations of AI-generated rehabilitation recommendations. The dual assessment of clinical relevance and readability offers a structured approach to understanding how these models perform in generating patient-centered rehabilitation plans. These findings contribute to the ongoing discussion on integrating AI into multidisciplinary care, highlighting potential applications and areas for further refinement. The integration of AI-generated rehabilitation plans into clinical practice could enhance multidisciplinary decision-making by providing structured, evidence-based recommendations tailored to individual patient needs. In particular, these tools could be valuable in settings with limited access to specialized rehabilitation providers, where AI models may help bridge the gap in expertise. Moreover, by standardizing rehabilitation protocols, AI has the potential to reduce variability in care while promoting adherence to best practices. However, for these technologies to be effectively implemented, further research is needed to assess their adaptability to real-world clinical scenarios, their ability to account for patient-specific factors, and their integration into electronic health record (EHR) systems to streamline their use in daily practice.
The key findings reveal that ChatGPT-4o achieved the highest subjective performance, as evidenced by a Likert scale mean of 4.90 ± 0.32. At the same time, DeepSeek V3 and Gemini 2 recorded moderate scores (4.00 ± 0.82 and 3.90 ± 0.74, respectively), and Copilot performed significantly lower with a mean of 2.70 ± 0.82. In parallel, the DISCERN score, a measure of the quality and reliability of clinical information, was similar for ChatGPT-4o, DeepSeek V3, and Gemini 2 (ranging from 46.90 ± 2.60 to 47.20 ± 2.53), whereas Copilot consistently scored 40.0. Readability metrics further differentiated the models: Gemini 2 produced outputs with a notably higher Flesch Reading Ease (12.25 ± 7.22) and a lower Flesch–Kincaid Grade Level (16.60 ± 1.40), suggesting its texts are easier to understand. Meanwhile, the Coleman–Liau Index demonstrated smaller variations among the models, ranging from 18.23 ± 0.79 for Copilot to 19.68 ± 1.32 for ChatGPT-4o. A one-way analysis of variance confirmed that these differences were statistically significant for the Likert scale (F(3, 36) = 16.50, p value < 0.001), DISCERN score (F(3, 36) = 24.63, p value < 0.001), Flesch Reading Ease (F(3, 36) = 4.35, p value ≈ 0.01), and Flesch–Kincaid Grade Level (F(3, 36) = 11.82, p value < 0.001), while the Coleman–Liau Index result was marginally non-significant (F(3, 36) = 2.84, p value ≈ 0.06). Therefore, ChatGPT-4o’s superior performance, coupled with the enhanced readability of Gemini 2, underscores the nuanced strengths of current AI technologies.
Regarding the subjective evaluation, the superior Likert score of ChatGPT-4o suggests that its outputs are perceived as highly acceptable and coherent by clinical experts. This observation is consistent with the emerging literature that emphasizes the growing capability of large language models to solve complex clinical problems intelligently and even outperform traditional clinical decision-making in certain scenarios. The high rating may reflect ChatGPT-4o’s advanced natural language processing abilities, which enable it to generate nuanced and contextually appropriate recommendations [
12,
13,
14,
15]. While this study involved a limited number of evaluators, all were senior specialists in head and neck surgery, ensuring a high level of clinical expertise. Expanding the panel of reviewers in future studies could help refine the assessment by incorporating a broader range of clinical perspectives. However, given the structured evaluation criteria used, the key performance trends observed across different LLMs are likely to remain consistent.
In contrast, the moderate Likert scores for DeepSeek V3 and Gemini 2 indicate that while these models are competent, they might not capture the same level of subtlety or contextual integration as ChatGPT-4o. However, Copilot’s significantly lower rating likely stems from its design as a general-purpose AI rather than a model optimized for medical applications. Unlike ChatGPT-4o and Gemini 2, which are trained using extensive medical literature, Copilot’s responses were more generic and lacked clinical depth. Additionally, its outputs were often less structured and detailed, making them less useful for rehabilitation planning. These findings highlight the importance of using AI models specifically trained for medical contexts to ensure clinically relevant recommendations. Such discrepancies highlight the importance of selecting the appropriate AI model based on specific clinical applications and demonstrate that not all AI algorithms are equally effective, a point corroborated by previous research [
16].
The analysis of the DISCERN scores further supports these interpretations. The comparable scores for ChatGPT-4o, DeepSeek V3, and Gemini 2 suggest that these models can produce reliable and evidence-based information essential for formulating safe rehabilitation protocols. In contrast, Copilot’s consistently lower score may indicate deficiencies in its content generation process, raising concerns about its reliability for clinical decision support. These findings align with the broader trend of exploring AI’s potential in perioperative care while emphasizing the need for rigorous quality control.
Readability is a critical factor in the practical application of rehabilitation protocols. Gemini 2’s higher Flesch Reading Ease Score and lower Flesch–Kincaid Grade Level imply that its outputs are more accessible, which could facilitate better understanding among clinicians and patients. However, while readability metrics such as DISCERN, Flesch Reading Ease, Flesch–Kincaid Grade Level, and Coleman–Liau Index provide objective measures of text complexity, they also have inherent limitations when applied to medical texts. These indices primarily evaluate structural aspects of language, such as sentence length and word difficulty, but do not assess whether the content is clinically accurate, contextually appropriate, or effectively communicates complex medical concepts. For example, a text with a high readability score may be oversimplified and omit essential clinical details, while a more complex text with a lower score may contain crucial information needed for precise rehabilitation planning. Furthermore, these metrics were originally developed for general educational materials rather than specialized medical content, making their direct applicability to AI-generated rehabilitation programs somewhat limited. Therefore, while readability scores provide useful insights into the accessibility of AI-generated content, they should be interpreted alongside qualitative expert assessments to ensure that rehabilitation plans are both comprehensible and clinically sound.
Using standardized readability measures in our study allows for a valid comparison with the existing literature and reinforces that clarity is paramount in clinical documentation. Although the Coleman–Liau Index did not reveal stark differences among the models, its relatively stable values suggest that while basic text complexity may be similar, other aspects of readability, such as sentence structure and vocabulary, play a more pivotal role in ensuring effective communication.
The statistical significance established through the one-way analysis of variance lends robust support to these findings. The highly significant p values (all below 0.01, except for the borderline result of the Coleman–Liau Index) confirm that the observed differences in performance, quality, and readability among the models are not due to random variation. This reinforces the validity of our conclusions and suggests that each model’s distinct strengths and weaknesses are inherent to its design and training data.
It is vital for all who use LLMs to understand that their output is not static. The same prompt entered today will likely not return the same result if entered a year later. These results have broader implications within the context of AI integration in clinical practice. The ability of large language models to generate structured rehabilitation protocols aligns with previous research on AI-assisted clinical decision-making. While prior studies have explored the role of LLMs in generating medical guidelines and decision support tools [
12,
13], their application in postoperative rehabilitation planning remains less examined. This study builds upon the existing research by evaluating AI-generated rehabilitation recommendations specifically in head and neck surgery, an area where patient-specific variability often limits standardized protocols. However, as with other AI applications in healthcare, the integration of LLMs in rehabilitation requires further validation to ensure their clinical reliability and adaptability [
16,
17,
18]. For instance, while AI algorithms have shown superiority in specific decision-making tasks, translating this potential into a reliable clinical tool requires adherence to principles such as TURBO (Testable, Usable, Reliable, Beneficial, Operable) to ensure consistent performance [
15].
Furthermore, the dynamic nature of AI outputs, wherein the same input may yield different results over time due to continuous updates in training data, necessitates that these tools be used under strict clinical supervision [
18]. This study, like others in the field, acknowledges that while AI can provide valuable insights and augment clinical practice, its output must be validated through prospective studies, randomized controlled trials, and longitudinal research. Additionally, integrating AI with established technologies such as Virtual Surgical Planning, a cornerstone in head and neck surgery, could further refine patient-specific rehabilitation strategies and optimize surgical outcomes.
This study has several limitations that should be considered when interpreting the findings. Firstly, the rehabilitation programs generated by LLMs were evaluated using hypothetical clinical scenarios rather than real patient data, which may limit the direct applicability of the results to clinical practice. Secondly, the number of expert reviewers was limited to three senior clinicians, which, although providing a high level of expertise, may not fully capture the diversity of clinical perspectives. Furthermore, clinician assessments are inherently subjective and may be influenced by individual biases, prior experiences, or expectations, introducing potential variability in the evaluation of LLM outputs. Expanding the panel and incorporating blinded or standardized assessment protocols in future studies could help mitigate these biases and enhance reliability. Thirdly, while readability metrics were used to assess text complexity, they do not fully capture the nuances of patient comprehension or the effectiveness of AI-generated recommendations in real-world rehabilitation settings. Finally, as LLMs continue to evolve, their outputs are subject to ongoing updates, meaning that future versions of these models may yield different results. Prospective studies involving real patient cases and longitudinal follow-up are necessary to validate the clinical utility of AI-generated rehabilitation programs.
This comprehensive analysis demonstrates significant differences in the performance, quality, and readability of rehabilitation protocols generated by various AI models [
17,
18,
19]. The superior performance of ChatGPT-4o, coupled with the enhanced readability of Gemini 2, underscores the nuanced strengths of current AI technologies while highlighting areas needing improvement [
20]. These findings contribute to the evolving landscape of AI in perioperative care by providing critical insights that can inform future research and clinical practice [
21]. Moving forward, multidisciplinary collaborations and rigorous clinical validations must guide the integration of AI into healthcare, ensuring that these advanced tools ultimately translate into improved patient outcomes and safer, more effective clinical interventions [
22].