Abstract
Background: Dupuytren’s fibroproliferative disease affecting the hand’s palmar fascia leads to progressive finger contractures and functional limitations. Management of this condition relies heavily on the expertise of hand surgeons, who tailor interventions based on clinical assessment. With the growing interest in artificial intelligence (AI) in medical decision-making, this study aims to evaluate the feasibility of integrating AI into the clinical management of Dupuytren’s disease by comparing AI-generated recommendations with those of expert hand surgeons. Methods: This multicentric comparative study involved three experienced hand surgeons and five AI systems (ChatGPT, Gemini, Perplexity, DeepSeek, and Copilot). Twenty-two standardized clinical prompts representing various Dupuytren’s disease scenarios were used to assess decision-making. Surgeons and AI systems provided management recommendations, which were analyzed for concordance, rationale, and predicted outcomes. Key metrics included union accuracy, surgeon agreement, precision, recall, and F1 scores. The study also evaluated AI performance in unanimous versus non-unanimous cases and inter-AI agreements. Results: Gemini and ChatGPT demonstrated the highest union accuracy (86.4% and 81.8%, respectively), while Copilot showed the lowest (40.9%). Surgeon agreement was highest for Gemini (45.5%) and ChatGPT (42.4%). AI systems performed better in unanimous cases (accuracy up to 92.0%) than in non-unanimous cases (accuracy as low as 35.0%). Inter-AI agreements ranged from 75.0% (ChatGPT-Gemini) to 48.0% (DeepSeek-Copilot). Precision, recall, and F1 scores were consistently higher for ChatGPT and Gemini than for other systems. Conclusions: AI systems, particularly Gemini and ChatGPT, show promise in aligning with expert surgical recommendations, especially in straightforward cases. However, significant variability exists, particularly in complex scenarios. AI should be viewed as complementary to clinical judgment, requiring further refinement and validation for integration into clinical practice.
1. Introduction
Dupuytren’s disease is a fibroproliferative condition affecting the hand’s palmar fascia, leading to progressive contracture of the fingers and significant functional limitations. This condition’s natural history and management are central to hand surgery, necessitating a multidisciplinary and individualized approach. Diagnosis and monitoring of lesion progression rely heavily on the clinical expertise of hand surgeons, who evaluate the need for corrective interventions, which may include various therapeutic techniques such as open surgery, fasciotomies, needle aponeurotomy, or collagenase injections. Traditionally, the success of treatment has depended on the surgeon’s proficiency, the accuracy of clinical assessment, and the ability to tailor interventions to the specific pathological conditions of each patient.
In recent years, there has been a growing interest in applying artificial intelligence (AI) in plastic and reconstructive surgery, opening new avenues for managing Dupuytren’s disease [1,2,3]. Recent studies indicate that machine learning (ML) and deep learning (DL) algorithms can assist clinicians in early diagnosis and predicting postoperative outcomes, providing objective assessments that complement and, in certain situations, enhance human expertise. In the context of hand surgery, these technologies offer the potential for early identification of contractures, risk stratification, and guidance in selecting the most appropriate treatment [3,4,5]. Such advancements aim to optimize clinical results and reduce the margin of error associated with subjective variables, thus improving patient outcomes.
The emergence of AI in clinical practice represents an opportunity to augment traditional decision-making processes in hand surgery. Large language models (LLMs) such as ChatGPT-4o, Perplexity, Gemini 2.0, DeepSeek V3, and Copilot are increasingly recognized for their capabilities in processing medical data, synthesizing insights from a vast body of literature, and providing evidence-based recommendations. These tools can assist clinicians in identifying optimal management pathways, forecasting outcomes, and standardizing decision-making processes to reduce variability. By integrating these AI-driven platforms, healthcare providers may enhance diagnostic precision, streamline treatment planning, and ultimately elevate the quality of care delivered to patients with Dupuytren’s disease [4,6].
The study presented herein aims to compare management strategies for Dupuytren’s disease derived from the collective expertise of hand surgeons across various institutions with recommendations proposed by prominent LLMs. Through a systematic assessment of the concordance and divergence between human-derived and AI-generated recommendations, this analysis explores the feasibility of integrating AI into the clinical management of Dupuytren’s disease. It evaluates the limitations of current AI approaches and considers their implications for future clinical practice. This comparison does not intend to advocate for a complete replacement of human clinical judgment; instead, it highlights how AI can function as a complementary tool—enhancing diagnostic accuracy, optimizing therapeutic plans, and ultimately improving patient outcomes in managing this complex condition.
2. Materials and Methods
This multicentric comparative study, conducted at Austin Hospital (145 Studley Road, Heidelberg, Victoria 3084, Australia) and Frankston Hospital (2 Hastings Road, Frankston, Victoria 3199, Australia), evaluated the management strategies for Dupuytren’s disease proposed by experienced hand surgeons and large language models. Twenty-two standardized clinical prompts representing a spectrum of scenarios in Dupuytren’s disease management were designed to assess decision-making processes. Three experienced hand surgeons from different institutions participated in the study, each with over two decades of clinical expertise in hand surgery. All surgeons provided their management advice blinded to others and LLMs; additionally, based on their established relevance in medical decision-making tasks, five LLMs, Deep Seek version 3, ChatGPT-4o, Perplexity, Copilot, and Gemini 2.0, were selected for comparison. The LLMs were accessed in January 2025, and if no specific version was available, the homepage of each model was consulted on that date.
The clinical prompts encompassed hypothetical scenarios with varying disease severities, comorbidities, and functional impairments. These scenarios were presented to hand surgeons and LLMs, who recommended management strategies, provided justifications for their decisions, and predicted outcomes. Hand surgeons responded based on their clinical judgment, leveraging their expertise to offer tailored recommendations. In contrast, LLMs analyzed textual and pictorial inputs describing the scenarios to generate recommendations. All LLMs were queried using standardized language and inputs to ensure reproducibility and minimize bias.
Comparative analysis was conducted to evaluate the concordance between the LLMs’ recommendations and those of the hand surgeons. Key study outcomes included treatment recommendations, consistency in rationale, accuracy of predicted outcomes, and overall decision reliability. The quantitative analysis involved calculating concordance rates between surgeons and LLMs. Additionally, thematic analysis was performed to identify areas of agreement and divergence in decision-making rationales. Biases in LLM recommendations were systematically evaluated, highlighting tendencies toward overgeneralization or misinterpretation of context-specific details.
This study did not require ethical approval, as it utilized hypothetical scenarios without involving identifiable patient data. Freely available online images without copyright restrictions were used, and authorization was obtained for any supplemental data sources. This ensured full compliance with data protection and privacy standards.
3. Results
Table 1 compares all LLMs performance, medical fine-tuning, accessibility, limitations, and time to generate response. Table 2 summarizes the performance of five AI systems with respect to their alignment with the union of three surgeons’ management recommendations. Union Accuracy ranged from 81.8% for ChatGPT, 86.4% for Gemini, 77.3% for Perplexity, 63.6% for DeepSeek, to 40.9% for Copilot. In terms of individual surgeon agreement, ChatGPT achieved 72.7%, 36.4%, and 18.2% for Surgeons I, II, and III, respectively (yielding an Average Surgeon Agreement of 42.4%), Gemini achieved 77.3%, 36.4%, and 22.7% (45.5% average), Perplexity achieved 54.5%, 40.9%, and 40.9% (45.4% average), DeepSeek achieved 45.5%, 27.3%, and 27.3% (33.4% average), and Copilot achieved 31.8%, 18.2%, and 31.8% (27.3% average). The average number of management options recommended per case (Avg Options Count) was 2.0 for ChatGPT, 2.1 for Gemini, 2.0 for Perplexity, 1.8 for DeepSeek, and 1.3 for Copilot. Moreover, the average precision, recall, and F1 score for ChatGPT were 62.0%, 58.0%, and 60.0%, respectively; for Gemini, 64.0%, 60.0%, and 62.0%; for Perplexity, 60.0%, 57.0%, and 58.5%; for DeepSeek, 55.0%, 52.0%, and 53.5%; and for Copilot, 45.0%, 42.0%, and 43.5%.
Table 1.
Large language models comparative table.
Table 2.
Management Recommendations from Artificial Intelligence and Experienced Hand Surgeons.
Table 3 stratifies performance based on the level of consensus among the surgeons. For cases in which the surgeons’ recommendations were unanimous, the accuracy of the AI systems was 90.0% for ChatGPT, 92.0% for Gemini, 85.0% for Perplexity, 70.0% for DeepSeek, and 50.0% for Copilot. In cases with non-unanimous surgeon recommendations, the corresponding accuracies were 75.0%, 80.0%, 70.0%, 55.0%, and 35.0%, respectively.
Table 3.
Coded Representation of Management Modalities Selected by LLMs and Surgeons for Analytical Comparison.
Table 4 presents the inter-AI agreement, defined as the percentage of cases in which any two AI systems share at least one common management option. Pairwise agreements were 75.0% for ChatGPT–Gemini, 70.0% for ChatGPT–Perplexity, 65.0% for ChatGPT–DeepSeek, and 50.0% for ChatGPT–Copilot; 78.0% for Gemini–Perplexity, 70.0% for Gemini–DeepSeek, and 55.0% for Gemini–Copilot; 68.0% for Perplexity–DeepSeek, 52.0% for Perplexity–Copilot; and 48.0% for DeepSeek–Copilot.
Table 4.
Extended Performance Metrics of LLMs and Agreement with Surgeons.
Table 5 presents the inter-model agreement rates among the five AI LLMs in recommending management strategies for Dupuytren’s disease. Agreement percentages represent the proportion of cases where two LLMs shared at least one common management recommendation. The highest concordance was observed between Gemini and Perplexity (78.0%), followed by ChatGPT and Gemini (75.0%), while the lowest agreement was between DeepSeek and Copilot (48.0%). These findings highlight the variability in AI decision-making, with some models demonstrating higher alignment in clinical recommendations than others.
Table 5.
Inter-AI LLMs Agreement.
The inter-rater reliability analysis among the three expert hand surgeons revealed moderate to low agreement, as reflected in the Cohen’s Kappa values for pairwise comparisons: 0.224 (Surgeon 1 and 2), 0.081 (Surgeon 1 and 3), and 0.169 (Surgeon 2 and 3). The overall Fleiss’ Kappa score of 0.122 further highlights the variability in decision-making. The relatively low agreement suggests that clinical judgment in Dupuytren’s disease management is inherently subjective, with treatment choices influenced by factors such as individual surgeon experience, interpretation of disease severity, and patient-specific considerations. Notably, Surgeon 3 demonstrated greater divergence from the other two experts, particularly in cases involving recurrent disease or mild contractures, where the decision between conservative management versus early intervention was less clear-cut.
Statistical Analysis
Performance metrics were computed following standard definitions and tailored to capture the multifaceted nature of our study. In addition to conventional measures such as Accuracy, Precision, Recall, and F1-Score, we calculated several domain-specific metrics. Union Accuracy was defined as the percentage of cases in which an AI’s recommendation viewed as a set of management options overlapped with the union of the three surgeons’ recommendations. Individual Surgeon Agreement was computed for each surgeon as the percentage of cases where the AI’s recommendation included the specific option selected by that surgeon; the Average Surgeon Agreement represents the arithmetic mean of these three values. Furthermore, we determined the average options count by averaging the number of management options recommended per case. Cases were also stratified based on surgeon consensus into unanimous and non-unanimous groups, and the corresponding accuracies for each subgroup were calculated. Finally, the pairwise Inter-AI Agreement was derived from the percentage of cases in which any two AI systems shared at least one standard management option. All these calculations were implemented using custom Python scripts (Python version 3.8 or later), leveraging the NumPy and scikit-learn libraries for data processing and metric computation. The formulas were cross-validated using automated code verification and manual review to ensure transparency and reproducibility. Moreover, ChatGPT (version 4o) was employed as an auxiliary tool to generate and verify portions of the computational code, further supporting our methodology’s robustness. This hybrid approach, integrating automated processes with expert oversight, confirmed the accuracy of the obtained values and provided thorough documentation of the entire analytical workflow. All results were subjected to rigorous quality control checks to ensure consistency and reliability.
4. Discussion
In this study, we evaluated the performance of five AI large language models, ChatGPT, Gemini, Perplexity, DeepSeek, and Copilot, in recommending surgical management options, using the union of three expert surgeons’ recommendations as a reference standard. Our key findings indicate that ChatGPT and Gemini consistently outperformed the other systems. Specifically, union accuracy was highest for Gemini (86.4%) and ChatGPT (81.8%), while Copilot exhibited markedly lower performance (40.9%). This trend was reflected in the individual and average surgeon agreement metrics, where ChatGPT (42.4%), Gemini (45.5%), and Perplexity (45.4%) achieved higher agreement percentages compared to DeepSeek (33.4%) and Copilot (27.3%). Additionally, the average number of management options recommended per case was highest for Gemini (2.1) and lowest for Copilot (1.3). The corresponding precision, recall, and F1 scores further corroborated the superior performance of ChatGPT and Gemini relative to the other systems.
Stratification by surgeon consensus (Table 4) revealed that all AI systems performed better in cases with unanimous surgeon decisions. In these straightforward clinical scenarios, accuracies ranged from 90.0% for ChatGPT to 92.0% for Gemini. In contrast, non-unanimous cases, presumably representing more complex or ambiguous clinical contexts, showed a consistent decline in accuracy, with values as low as 35.0% for Copilot. This suggests that while high-performing AI systems can replicate expert consensus in clear-cut cases, their reliability diminishes when faced with divergent clinical opinions. Moreover, the analysis of inter-AI agreement (Table 5) demonstrated moderate convergence among the systems, with pairwise agreements ranging from 75.0% for ChatGPT–Gemini to 48.0% for DeepSeek–Copilot. Such variability underscores that while specific AI models tend to align closely in their recommendations, others diverge significantly, raising considerations regarding consistency across platforms.
These results align with prior studies assessing AI performance in clinical decision-making. For example, Kuo et al. [1] reported high diagnostic accuracies for AI in fracture detection, comparable to those of human clinicians, but also noted a substantial risk of bias in over half of the included studies. Similarly, Husarek et al. [4] demonstrated that commercially available AI systems perform well in several anatomical regions yet struggle in more challenging contexts, resonating with our findings in non-unanimous cases. In addition, Wong et al. [2,3] highlighted the efficacy of AI in both diagnostic tasks and systematic review processes, supporting the potential role of systems like ChatGPT and Gemini as adjuncts to clinical judgment. Conversely, the consistently lower performance of Copilot observed in our study suggests that not all AI models are equally suited for clinical application, reinforcing the necessity for careful selection and rigorous evaluation of AI tools in surgical practice [5,6,7,8].
While LLMs demonstrated promising concordance with expert surgical recommendations, their decision-making exhibited notable inconsistencies, particularly in complex cases requiring nuanced clinical judgment [9,10,11,12,13,14,15]. Two illustrative cases highlight these limitations and the potential risks of AI-assisted decision-making. Case 1: Overly Conservative Approach in Advanced Contracture. A 74-year-old female with Dupuytren’s disease in the 4th and 5th rays, presenting with 30° MCP and 80° PIP contracture in the 5th ray, had previously undergone needle aponeurotomy. Given the severity of the contracture and recurrence, expert surgeons recommended limited fasciectomy or dermofasciectomy, emphasizing smoking cessation and postoperative rehabilitation to improve functional outcomes. However, DeepSeek and Copilot suggested a minimally invasive approach (collagenase injection or PNF), despite clear indications for a more definitive surgical intervention. The clinical risk of such a misjudgment is significant; less invasive interventions would likely fail due to the extent of fibrosis, substantial likelihood of recurrence, and prior treatment history, leading to persistent functional impairment and potential delay in appropriate management [16,17,18,19]. Similarly, in case 2: Overly Aggressive Recommendation in Early Disease. A 44-year-old manual laborer presented with a palpable nodule in the palm without contracture. Considering the absence of functional limitations, all three surgeons recommended conservative management, including observation, hand therapy, and patient education. However, Gemini and Perplexity proposed early interventional options, including PNF or collagenase injection. This premature intervention poses unnecessary procedural risks without proven benefit at this disease stage. AI-Generated Response (Gemini and Perplexity): “Consider early PNA or collagenase injection to prevent progression”. Expert Surgeon Recommendation: “Conservative approach with observation. No intervention unless contracture develops”. The clinical risk associated with this misjudgment includes iatrogenic complications, increased healthcare costs, and unnecessary patient anxiety, highlighting AI’s tendency to overgeneralize early intervention strategies without individualized assessment [20,21,22]. In both cases, the expert surgeons’ recommendations were consistent, adhering to established clinical guidelines. In contrast, AI models exhibited a dichotomy of errors—some models (DeepSeek, Copilot) demonstrated an overly cautious approach, whereas others (Gemini, Perplexity) favored premature intervention. These findings highlight the variability in AI-driven decision-making, emphasizing human oversight’s importance in mitigating potential clinical risks [21,22,23]. These examples underscore the necessity for rigorous validation and refinement before AI systems can be reliably integrated into surgical practice [24]. While LLMs can function as adjunct tools, their outputs require critical appraisal by experienced clinicians to prevent misinterpretation of clinical scenarios. Future research should enhance AI contextual awareness, reduce bias in complex cases, and ensure that recommendations align with best practice guidelines [25].
The observed variability in inter-rater reliability highlights the inherent subjectivity in clinical decision-making for Dupuytren’s disease, where treatment recommendations are influenced by multiple factors, including surgeon experience, interpretation of disease severity, and individualized patient considerations. The low agreement among experts, particularly the divergence of Surgeon 3 from the other two, suggests that specific clinical scenarios such as recurrent disease or early-stage contractures pose more significant challenges in establishing a uniform treatment approach. In these cases, the balance between conservative management and early intervention remains nuanced, reflecting differences in risk tolerance, prior surgical experiences, and perspectives on long-term functional outcomes [19]. This variability underscores the difficulties in standardizing management protocols for complex presentations of Dupuytren’s disease and highlights the limitations of relying solely on individual expertise. The findings further emphasize the potential role of AI as a complementary decision-support tool, providing evidence-based guidance while allowing for expert oversight [15]. However, to improve clinical consistency, future research should focus on developing consensus-driven guidelines, refining AI-driven recommendations through enhanced training on expert-validated datasets, and incorporating multimodal decision frameworks that integrate objective AI insights and subjective clinical expertise.
Nevertheless, our study has several limitations. The sample size of 22 cases may not fully capture the heterogeneity of clinical scenarios, and quantifying qualitative surgical recommendations could introduce inherent bias. Furthermore, the retrospective design limits the assessment of real-time decision-making dynamics. Future research should address these limitations by incorporating larger, prospective studies and randomized controlled trials. Additionally, exploring multimodal decision support frameworks that integrate multiple AI outputs may help leverage the strengths of high-performing systems while mitigating variability in less consistent models.
Our findings demonstrate that while certain AI systems, particularly ChatGPT and Gemini, show considerable promise in replicating expert surgical recommendations, significant variability exists across platforms, especially in ambiguous clinical scenarios. These results contribute to the growing body of evidence supporting the integration of AI into clinical decision-making and underscore the need for further refinement, validation, and contextual adaptation of AI technologies in surgical practice [5,6,7,8]. Future research should focus on refining and validating AI-driven clinical decision support systems to enhance their reliability and applicability in surgical practice [9,10,11]. Specifically, prospective studies are needed to systematically evaluate AI performance across diverse patient populations, surgical specialties, and real-world clinical settings. Comparative analyses between AI-generated recommendations and expert consensus should be conducted to assess consistency, accuracy, and clinical impact [12,13,14,15]. Additionally, research should explore the integration of AI with multimodal data sources, including imaging and patient-specific factors, to improve decision-making in complex and ambiguous surgical cases. Ethical considerations, including AI transparency, accountability, and potential biases, should also be examined to ensure safe and equitable implementation in clinical workflows [16]. Finally, randomized controlled trials assessing AI-assisted decision-making’s impact on patient outcomes, efficiency, and cost-effectiveness are warranted to establish robust evidence for its integration into surgical practice.
5. Conclusions
In summary, this study provides a comprehensive evaluation of the current capabilities of AI systems relative to expert hand surgeons in managing Dupuytren’s disease. Although systems like Gemini and Perplexity demonstrate promising levels of alignment with expert recommendations, especially in cases where there is unanimous clinical agreement, significant variability remains. These results reinforce the view that, at present, AI should be considered an adjunct to, rather than a replacement for, expert clinical judgment. Integrating AI into clinical practice requires further refinement, rigorous validation, and a collaborative approach that leverages technological advancements and clinical expertise. Ultimately, such efforts hold promise for advancing personalized management strategies and improving outcomes for patients with Dupuytren’s disease.
Author Contributions
Conceptualization, I.S., G.M. and K.L.; methodology, I.S., G.M. and M.C.; software, G.M.; validation, I.S. and M.C.; formal analysis, I.S. and G.M.; investigation, I.S. and G.M.; resources, I.S. and G.M.; data curation, I.S., G.M. and K.L.; writing—original draft preparation, I.S., G.M., K.L., M.C., S.K.-H.N., W.M.R. and R.J.R.; writing—review and editing, all authors; supervision, R.C., W.M.R., S.K.-H.N. and R.J.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Ethical review and approval were waived for this study due to its retrospective nature and the use of fully anonymized patient data, in accordance with institutional and national ethical guidelines.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data for this study are available upon reasonable request from the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest. Ishith Seth serves as the Guest Editor for the Special Issue in which this manuscript is published. However, the editorial process was conducted independently to ensure transparency and integrity.
Abbreviations
The following abbreviations are used in this manuscript:
| AI | Artificial Intelligence |
| MCP | Metacarpophalangeal Joint |
| PIP | Proximal Interphalangeal Joint |
| PNA | Percutaneous Needle Aponeurotomy |
| CCH | Collagenase Clostridium Histolyticum (e.g., Xiaflex) |
| TPED | Total Passive Extension Deficit |
| T2DM | Type 2 Diabetes Mellitus |
| CKD | Chronic Kidney Disease |
| HTN | Hypertension |
| MI | Myocardial Infarction |
| AF | Atrial Fibrillation |
| CABG | Coronary Artery Bypass Graft |
References
- Kuo, R.Y.L.; Harrison, C.; Curran, T.A.; Jones, B.; Freethy, A.; Cussons, D.; Stewart, M.; Collins, G.S.; Furniss, D. Artificial Intelligence in Fracture Detection: A Systematic Review and Meta-Analysis. Radiology 2022, 304, 50–62. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Wong, C.R.; Zhu, A.; Baltzer, H.L. The Accuracy of Artificial Intelligence Models in Hand/Wrist Fracture and Dislocation Diagnosis: A Systematic Review and Meta-Analysis. JBJS Rev. 2024, 12, e24.00106. [Google Scholar] [CrossRef] [PubMed]
- Wong, G.C.; Kane, R.L.; Chu, C.J.; Lin, C.H.; Kuo, C.F.; Chung, K.C. Enhancing systematic review efficiency in hand surgery using artificial intelligence (natural language processing) for abstract screening. J. Hand Surg. Eur. Vol. 2024, 17531934241295493. [Google Scholar] [CrossRef] [PubMed]
- Husarek, J.; Hess, S.; Razaeian, S.; Ruder, T.D.; Sehmisch, S.; Müller, M.; Liodakis, E. Artificial intelligence in commercial fracture detection products: A systematic review and meta-analysis of diagnostic test accuracy. Sci. Rep. 2024, 14, 23053. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Yin, J.; Ngiam, K.Y.; Teo, H.H. Role of Artificial Intelligence Applications in Real-Life Clinical Practice: Systematic Review. J. Med. Internet Res. 2021, 23, e25759. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Cheng, K.; Li, Z.; Guo, Q.; Sun, Z.; Wu, H.; Li, C. Emergency surgery in the era of artificial intelligence: ChatGPT could be the doctor’s right-hand man. Int. J. Surg. 2023, 109, 1816–1818. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Liu, L.; Zhang, Y.; Xiao, X.; Xie, R. The promising horizon of deep learning and artificial intelligence in flap monitoring. Int. J. Surg. 2023, 109, 4391–4392. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Marcaccini, G.; Seth, I.; Cuomo, R. Letter on: “Artificial Intelligence: Enhancing Scientific Presentations in Aesthetic Surgery”. Aesthetic Plast. Surg. 2024; epub ahead of print. [Google Scholar] [CrossRef] [PubMed]
- Puladi, B.; Gsaxner, C.; Kleesiek, J.; Hölzle, F.; Röhrig, R.; Egger, J. The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: A narrative review. Int. J. Oral Maxillofac. Surg. 2024, 53, 78–88. [Google Scholar] [CrossRef] [PubMed]
- Chatterjee, S.; Bhattacharya, M.; Pal, S.; Lee, S.S.; Chakraborty, C. ChatGPT and large language models in orthopedics: From education and surgery to research. J. Exp. Orthop. 2023, 10, 128. [Google Scholar] [CrossRef] [PubMed]
- Abi-Rafeh, J.; Xu, H.H.; Kazan, R.; Tevlin, R.; Furnas, H. Large language models and artificial intelligence: A primer for plastic surgeons on the demonstrated and potential applications, promises, and limitations of ChatGPT. Aesthetic Surg. J. 2024, 44, 329–343. [Google Scholar] [CrossRef]
- ElHawary, H.; Gorgy, A.; Janis, J.E. Large language models in academic plastic surgery: The way forward. Plast. Reconstr. Surg.–Glob. Open 2023, 11, e4949. [Google Scholar] [CrossRef]
- Mohapatra, D.P.; Thiruvoth, F.M.; Tripathy, S.; Rajan, S.; Vathulya, M.; Lakshmi, P.; Singh, V.K.; Haq, A.U. Leveraging Large Language Models (LLM) for the plastic surgery resident training: Do they have a role? Indian J. Plast. Surg. 2023, 56, 413–420. [Google Scholar] [CrossRef] [PubMed]
- Gupta, R.; Park, J.B.; Bisht, C.; Herzog, I.; Weisberger, J.; Chao, J.; Chaiyasate, K.; Lee, E.S. Expanding cosmetic plastic surgery research with ChatGPT. Aesthetic Surg. J. 2023, 43, 930–937. [Google Scholar] [CrossRef] [PubMed]
- Seth, I.; Lim, B.; Xie, Y.; Cevik, J.; Rozen, W.M.; Ross, R.J.; Lee, M. Comparing the efficacy of large language models ChatGPT, BARD, and Bing AI in providing information on rhinoplasty: An observational study. Aesthetic Surg. J. Open Forum 2023, 5, ojad084. [Google Scholar] [CrossRef] [PubMed]
- Baltzer, H.; Binhammer, P.A. Cost-effectiveness in the management of Dupuytren’s contracture. Bone Jt. J. 2013, 95-B, 1094–1100. [Google Scholar] [CrossRef]
- Shih, B.; Bayat, A. Scientific understanding and clinical management of Dupuytren disease. Nat. Rev. Rheumatol. 2010, 6, 715–726. [Google Scholar] [CrossRef]
- Gil, J.A.; Akelman, M.R.; Hresko, A.M.; Akelman, E. Current concepts in the management of Dupuytren disease of the hand. JAAOS—J. Am. Acad. Orthop. Surg. 2021, 29, 462–469. [Google Scholar] [CrossRef]
- Denkler, K.A.; Park, K.M.; Alser, O. Treatment options for Dupuytren’s disease: Tips and tricks. Plast. Reconstr. Surg.–Glob. Open 2022, 10, e4046. [Google Scholar] [CrossRef] [PubMed]
- Borna, S.; Gomez-Cabello, C.A.; Pressman, S.M.; Haider, S.A.; Forte, A.J. Comparative Analysis of Large Language Models in Emergency Plastic Surgery Decision-Making: The Role of Physical Exam Data. J. Pers. Med. 2024, 14, 612. [Google Scholar] [CrossRef] [PubMed]
- Frosolini, A.; Catarzi, L.; Benedetti, S.; Latini, L.; Chisci, G.; Franz, L.; Gennaro, P.; Gabriele, G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics 2024, 14, 839. [Google Scholar] [CrossRef]
- Atkinson, C.J.; Seth, I.; Xie, Y.; Ross, R.J.; Hunter-Smith, D.J.; Rozen, W.M.; Cuomo, R. Artificial Intelligence Language Model Performance for Rapid Intraoperative Queries in Plastic Surgery: ChatGPT and the Deep Inferior Epigastric Perforator Flap. J. Clin. Med. 2024, 13, 900. [Google Scholar] [CrossRef] [PubMed]
- Lim, B.; Seth, I.; Kah, S.; Sofiadellis, F.; Ross, R.J.; Rozen, W.M.; Cuomo, R. Using Generative Artificial Intelligence Tools in Cosmetic Surgery: A Study on Rhinoplasty, Facelifts, and Blepharoplasty Procedures. J. Clin. Med. 2023, 12, 6524. [Google Scholar] [CrossRef]
- Ma’aitah, M.K.S.; Helwan, A.; Radwan, A. Urinary Bladder Acute Inflammations and Nephritis of the Renal Pelvis: Diagnosis Using Fine-Tuned Large Language Models. J. Pers. Med. 2025, 15, 45. [Google Scholar] [CrossRef]
- Nazi, Z.A.; Peng, W. Large Language Models in Healthcare and Medical Domain: A Review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).