Current Applications of Chatbots Powered by Large Language Models in Oral and Maxillofacial Surgery: A Systematic Review
Abstract
1. Introduction
2. Materials and Methods
2.1. Eligibility Criteria
- (P) Human subjects including patients, dental professionals, or laypeople;
- (I) LLM-based chatbots, such as ChatGPT (Open AI), Gemini (Google), and Copilot (Microsoft), that were implemented in OMFS domains, where they included both traditional oral surgery procedures and more extensive surgical interventions in the maxillofacial region;
- (C) Conventional approach for finding medical information (brochures or human experts) or LLM-based chatbots comparison;
- (O) Evaluation of the answers provided by AI-based Chatbots, in terms of accuracy, precision, readability, and response processing speed, in the field of OMFS;
- (S) Primary research articles published in English in peer-reviewed academic journals between 2023 and 8 April 2025 were included.
2.2. Search Strategy
2.3. Study Selection and Data Extraction
- (1)
- Title and abstract screening: Two independent reviewers (U.C. and S.S.) evaluated the titles and abstracts of all records against the predefined eligibility criteria based on the PICOS framework. Records clearly not meeting inclusion criteria were excluded at this stage.
- (2)
- Full-text assessment: For all studies that met the inclusion criteria or where relevance remained uncertain, the full texts were retrieved and independently reviewed in detail by the same two reviewers. Studies not meeting eligibility criteria upon full-text review were excluded, with reasons documented.
2.4. Quality Assessment
3. Results
3.1. Study Characteristics
- Unit of analysis: Two studies assessed chatbot responses to layperson queries (patient-level communication), while the other two analyzed responses to specialized, case-based, or procedural questions (clinician-level interaction).
- Methodological approaches: These ranged from structured question sets rated by expert panels (e.g., five-point or three-point Likert scales, Global Quality Scale) to repeated simulations using large batches of identical prompts (e.g., 30 × 30 input permutations).
- LLM platforms and versions: Tools evaluated included GPT-3.5, GPT-4 (OpenAI), Bing (Microsoft), Bard (Google), and Claude Instant (Anthropic), with performance varying significantly across models and use cases.
- Sample sizes and study contexts: Expert rates ranged from 2 to 10 in number, and the number of evaluated chatbot responses ranged from 20 to 900 across studies.
3.2. Main Findings
3.2.1. Chatbot Powered by LLMs and Patients
3.2.2. Chatbot Powered by LLMs and Specialties of Sectors or Generic Dentistry
3.3. Comparative Analysis of Results
3.4. Quality Assessment and Risk of Bias
4. Discussion
- -
- The number of eligible primary studies was relatively small, mainly due to the recent introduction of large language model (LLM)-based conversational agents in medical and dental contexts. This limited sample may reduce the precision of the results and the generalizability of the conclusions.
- -
- Significant heterogeneity was observed among the included studies regarding study design, evaluation metrics, and clinical settings, which limited the comparability of outcomes and prevented meta-analytical synthesis.
- -
- The review was restricted to studies published in English and did not include gray literature, potentially introducing language and publication biases.
- -
- The review did not incorporate expert opinions or clinical perspectives, which could have provided valuable contextual insight into the findings.
- -
- Considering the rapid pace of technological advancement and the continuous emergence of new evidence, it is essential that this review be regularly updated to remain current, relevant, and aligned with the evolving landscape of AI applications in OMFS.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
LLMs | Large Language Models |
NLP | Natural Language Processing |
PECO | Population, Exposure, Comparator, Outcomes |
PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
ROBINS-I | Risk Of Bias in Non-Randomized Studies of Interventions |
USMLE | United States Medical Licensing Examination |
NRSIs | Non-Randomized Intervention Studies |
LS | Likert Scale |
GQS | Global Quality Scale |
OMFS | Oral and Maxillofacial Surgery |
References
- Aggarwal, A.; Tam, C.C.; Wu, D.; Li, X.; Qiao, S. Artificial Intelligence-Based Chatbots for Promoting Health Behavioral Changes: Systematic Review. J. Med. Internet Res. 2023, 25, e40789. [Google Scholar] [CrossRef] [PubMed]
- Abbate, G.M.; Caria, M.P.; Montanari, P.; Mannu, C.; Orrù, G.; Caprioglio, A.; Levrini, L. Periodontal health in teenagers treated with removable aligners and fixed orthodontic appliances. J. Orofac. Orthop. 2015, 76, 240–250. [Google Scholar] [CrossRef] [PubMed]
- Eggmann, F.; Weiger, R.; Zitzmann, N.U.; Blatz, M.B. Implications of large language models such as ChatGPT for dental medicine. J. Esthet. Restor. Dent. 2023, 35, 1098–1102. [Google Scholar] [CrossRef] [PubMed]
- Weizenbaum, J. ELIZA—A computer program for the study of natural language communication between man and machine. Commun. ACM 1966, 9, 36–45. [Google Scholar] [CrossRef]
- Van den Broeck, E.; Zarouali, B.; Poels, K. Chatbot advertising effectiveness: When does the message get through? Comput. Hum. Behav. 2019, 98, 150–157. [Google Scholar] [CrossRef]
- Safi, Z.; Abd-Alrazaq, A.; Khalifa, M.; Househ, M. Technical Aspects of Developing Chatbots for Medical Applications: Scoping Review. J. Med. Internet Res. 2020, 22, e19127. [Google Scholar] [CrossRef]
- Gnewuch, U.; Morana, S.; Maedche, A. Towards Designing Cooperative and Social Conversational Agents for Customer Service. In Proceedings of the ICIS, Seoul, Republic of Korea, 10–13 December 2017; pp. 1–13. [Google Scholar]
- Makrygiannakis, M.A.; Giannakopoulos, K.; Kaklamanos, E.G. Evidence-based potential of generative artificial intelligence large language models in orthodontics: A comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur. J. Orthod. 2024, 46, cjae017. [Google Scholar] [CrossRef]
- Xue, J.; Zhang, B.; Zhao, Y.; Zhang, Q.; Zheng, C.; Jiang, J.; Li, H.; Liu, N.; Li, Z.; Fu, W.; et al. Evaluation of the Current State of Chatbots for Digital Health: Scoping Review. J. Med. Internet Res. 2023, 25, e47217. [Google Scholar] [CrossRef]
- Mariño, R.J.; Zaror, C. Legal issues in digital oral health: A scoping review. BMC Health Serv. Res. 2024, 24, 6. [Google Scholar] [CrossRef]
- Alsayed, A.A.; Aldajani, M.B.; Aljohani, M.H.; Alamri, H.; Alwadi, M.A.; Alshammari, B.Z.; Alshammari, F.R. Assessing the quality of AI information from ChatGPT regarding oral surgery, preventive dentistry, and oral cancer: An exploration study. Saudi Dent. J. 2024, 36, 1483–1489. [Google Scholar] [CrossRef]
- Ennenga, G.R. Artificial evolution. Artif. Life 1997, 3, 51–61. [Google Scholar] [CrossRef] [PubMed]
- Choi, R.Y.; Coyner, A.S.; Kalpathy-Cramer, J.; Chiang, M.F.; Campbell, J.P. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl. Vis. Sci. Technol. 2020, 9, 14. [Google Scholar] [PubMed]
- Adamopoulou, E.; Moussiades, L. An overview of chatbot technology. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Neos Marmaras, Greece, 5–7 June 2020; pp. 373–383. [Google Scholar]
- Battaglia, J. L’Intelligenza Artificiale Generativa per la Valutazione Formativa: Lo Sviluppo di ChatPED, un Tutor Intelligente per Supportare Insegnanti e Studenti. Master’s Thesis, Department of Philosophy, Sociology, Education and Applied Psychology, University of Padova, Padova, Italy, 2023. [Google Scholar]
- Eggmann, F.; Blatz, M.B. ChatGPT: Chances and Challenges for Dentistry. Compend. Contin. Educ. Dent. (15488578) 2023, 44, 220. [Google Scholar]
- Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
- Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
- Kılınç, D.D.; Mansız, D. Examination of the reliability and readability of Chatbot Generative Pretrained Transformer’s (ChatGPT) responses to questions about orthodontics and the evolution of these responses in an updated version. Am. J. Orthod. Dentofac. Orthop. 2024, 165, 546–555. [Google Scholar] [CrossRef]
- Mohammad-Rahimi, H.; Khoury, Z.H.; Alamdari, M.I.; Rokhshad, R.; Motie, P.; Parsa, A.; Tavares, T.; Sciubba, J.J.; Price, J.B.; Sultan, A.S. Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology. Oral. Surg. Oral. Med. Oral. Pathol. Oral. Radiol. 2024, 137, 508–514. [Google Scholar] [CrossRef]
- Acar, A.H. Can natural language processing serve as a consultant in oral surgery? Stomatol. Maxillofac. Surg. 2024, 125, 101724. [Google Scholar] [CrossRef]
- Suárez, A.; Jiménez, J.; Llorente de Pedro, M.; Andreu-Vázquez, C.; Díaz-Flores García, V.; Gómez Sánchez, M.; Freire, Y. Beyond the Scalpel: Assessing ChatGPT’s potential as an auxiliary intelligent virtual assistant in oral surgery. Comput. Struct. Biotechnol. J. 2024, 24, 46–52. [Google Scholar] [CrossRef]
- Sinclair, P.; Kable, A.; Levett-Jones, T. The effectiveness of internet-based e-learning on clinician behavior and patient outcomes: A systematic review protocol. JBI Database Syst. Rev. Implement. Rep. 2015, 13, 52–64. [Google Scholar] [CrossRef]
- Cai, Y.; Zhao, R.; Zhao, H.; Li, Y.; Gou, L. Exploring the use of ChatGPT/GPT-4 for patient follow-up after oral surgeries. Int. J. Oral. Maxillofac. Surg. 2024, 53, 867–872. [Google Scholar] [CrossRef] [PubMed]
- Azadi, A.; Gorjinejad, F.; Mohammad-Rahimi, H.; Tabrizi, R.; Alam, M.; Golkar, M. Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery. Oral. Surg. Oral. Med. Oral. Pathol. Oral. Radiol. 2024, 137, 587–593. [Google Scholar] [CrossRef] [PubMed]
- Karobari, M.I.; Suryawanshi, H.; Patil, S.R. Revolutionizing oral and maxillofacial surgery: ChatGPT’s impact on decision support, patient communication, and continuing education. Int. J. Surg. 2024, 110, 3143–3145. [Google Scholar] [CrossRef]
- Santonocito, S.; Palazzo, G.; Indelicato, F.; Chaurasia, A.; Isola, G. Effects induced by periodontal disease on overall quality of life and self-esteem. Mediterr. J. Clin. Psychol. 2022, 10, 1. [Google Scholar]
- Balel, Y. Can ChatGPT be used in oral and maxillofacial surgery? J. Stomatol. Oral Maxillofac. Surg. 2023, 124, 101471. [Google Scholar] [CrossRef]
- Arif, T.B.; Munaf, U.; Ul-Haque, I. The future of medical education and research: Is ChatGPT a blessing or blight in disguise? Med. Educ. Online 2023, 28, 2181052. [Google Scholar] [CrossRef]
- Santonocito, S.; Cicciù, M.; Ronsivalle, V. Evaluation of the impact of AI-based chatbot on orthodontic patient education: A preliminary randomised controlled trial. Clin. Oral Investig. 2025, 29, 278. [Google Scholar] [CrossRef]
- Chew, H.S.J. The Use of Artificial Intelligence-Based Conversational Agents (Chatbots) for Weight Loss: Scoping Review and Practical Recommendations. JMIR Med. Inf. 2022, 10, e32578. [Google Scholar] [CrossRef]
- Danesh, A.; Pazouki, H.; Danesh, F.; Danesh, A.; Vardar-Sengul, S. Artificial intelligence in dental education: ChatGPT’s performance on the periodontic in-service examination. J. Periodontol. 2024, 95, 682–687. [Google Scholar] [CrossRef]
PubMed |
((“Oral Surgical Procedures” [MeSH Terms] OR “oral surgery” [All Fields] OR “dental surgery” [All Fields] OR “maxillofacial surgery” [All Fields]) AND (“large language models” [All Fields] OR “LLM” [All Fields] OR “ChatGPT” [All Fields] OR “GPT” [All Fields] OR “generative pre-trained transformer” [All Fields] OR “pretraining language model” [All Fields] OR “AI language model” [All Fields] OR “chatbot” [All Fields] OR “natural language processing” [MeSH Terms] OR “natural language processing” [All Fields] OR “NLP” [All Fields])) |
Scopus |
TITLE-ABS-KEY ((“oral surgery” OR “dental surgery” OR “maxillofacial surgery”) AND (“large language model” OR LLM OR ChatGPT OR GPT OR “generative pre-trained transformer” OR “pretraining language model” OR “AI language model” OR chatbot OR “natural language processing” OR NLP)) |
Web of Science |
TS = (“oral surgery” OR “dental surgery” OR “maxillofacial surgery”) AND TS = (“large language model” OR LLM OR ChatGPT OR GPT OR “generative pre-trained transformer” OR “pretraining language model” OR “AI language model” OR chatbot OR “natural language processing” OR NLP) |
Google Scholar |
“oral surgery” OR “dental surgery” OR “maxillofacial surgery” AND “large language model” OR LLM OR ChatGPT OR GPT OR “generative pre-trained transformer” OR “AI language model” OR chatbot OR “natural language processing” OR NLP |
Author | Year | Country | Title | Application Area | LLM Tool(s)) | Sample Sizes | Methods | Scale Assessing | Key Results | Conclusion |
---|---|---|---|---|---|---|---|---|---|---|
Acar [21] | 2024 | Turkey | Can natural language processing serve as a consultant in oral surgery | Clinical Q&A in oral surgery | ChatGPT 3.5 (OpenAI), Bing(Microsoft), Bard (Google) | 10 experts, 20 questions | Experts submitted 20 clinical questions to 3 chatbots | Five-point Likert Scale (accuracy/completeness of responses); Global Quality Scale (clarity of responses) | ChatGPT outperformed Bing and Bard in both accuracy and clarity, with statistically significant differences (p < 0.001). | ChatGPT provided more accurate and clear responses |
Suárez et al. [22] | 2024 | Spain | Beyond the Scalpel: Assessing ChatGPT’s potential as an auxiliary intelligent virtual assistant in oral surgery | Clinical decision-making support in oral surgery | ChatGPT-4 (OpenAI) | 2 experts, 30 × 30 = 900 replies | Two oral surgeons evaluated ChatGPT-4’s accuracy and reliability as a virtual assistant for clinical decision-making in oral surgery. | Three-point Likert Scake (accuracy) | Accuracy reached 71.7%, with expert judgment consistency ranging from moderate to nearly perfect. | ChatGPT-4, could have potential as virtual assistance in oral surgery. |
Cai et al. [24] | 2024 | China | Exploring the use of ChatGPT/GPT-4 for patient follow-up after oral surgeries | Patient follow-up and reassurance after oral surgery | ChatGPT-3.5 OpenAI)/GPT-4 (OpenAI) | 3 experts, 30 questions | Three oral and maxillofacial surgeons tested chatbot replies to common follow-up questions | Score 0–10 (accuracy and advice quality) | ChatGPT-3.5/GPT-4 scored perfectly for medical accuracy and recommendation rationality, also effectively sensing and reassuring patient emotions. | ChatGPT/GPT-4 could be used for patient follow-up after oral surgeries. |
Azadi et al. [25] | 2024 | Iran | Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery | Clinical decision-making with case-based questions in OMFS | GPT-3.5 (OpenAI), GPT-4 (OpenAi), Claude-Instant (Anthropic), Bing (Microsoft), Bard (Google) | 3 experts, 50 case-based questions in multiple-choice (MCQ) and open-ended (OQ) formats. | A group of 3 board-certified oral and maxillofacial surgeons evaluated answers to MCQ and open-ended Qs | Modified Global Quality Scale (GQS). | Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing answered 34%, 36%, 38%, 38%, and 26% of questions correctly. GPT-4 had the highest scores (“4” or “5”) on open-ended questions, while Bing had the most low scores (“1” or “2”) | LLM-based chatbots are not yet reliable consultants for clinical decision-making. |
Aspect | Documented Advantages | Highlighted Limitations | Main Sources |
---|---|---|---|
Clinical Decision Support | Quick responses to common clinical questions; assistance in treatment planning | Accuracy not guaranteed in complex or non-standardized cases | Acar 2024 [21], Azadi 2024 [25] |
Operational Efficiency | Reduction in workload; automation of repetitive tasks | Dependence on prompt quality and risk of poorly contextualized information | Suárez 2024 [22], Cai 2024 [24] |
Intraoperative Assistance | Potential real-time support in managing intraoperative complications | Lack of expert clinical judgment and inability to react to unforeseen variables | Suárez 2024 [22] |
Patient Management (Pre/Post-op) | Standardized communication regarding surgical instructions and follow-up | Lack of empathy and inability to provide human reassurance | Cai 2024 [24], Suárez 2024 [22] |
Response Accuracy | High for frequently asked questions and standard cases | Low consistency in complex scenarios; potential contextual errors | Azadi 2024 [25] |
Patient Acceptability | Constant accessibility and availability | Concerns about the lack of human contact and personalization | — |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ronsivalle, V.; Santonocito, S.; Cammarata, U.; Lo Muzio, E.; Cicciù, M. Current Applications of Chatbots Powered by Large Language Models in Oral and Maxillofacial Surgery: A Systematic Review. Dent. J. 2025, 13, 261. https://doi.org/10.3390/dj13060261
Ronsivalle V, Santonocito S, Cammarata U, Lo Muzio E, Cicciù M. Current Applications of Chatbots Powered by Large Language Models in Oral and Maxillofacial Surgery: A Systematic Review. Dentistry Journal. 2025; 13(6):261. https://doi.org/10.3390/dj13060261
Chicago/Turabian StyleRonsivalle, Vincenzo, Simona Santonocito, Umberto Cammarata, Eleonora Lo Muzio, and Marco Cicciù. 2025. "Current Applications of Chatbots Powered by Large Language Models in Oral and Maxillofacial Surgery: A Systematic Review" Dentistry Journal 13, no. 6: 261. https://doi.org/10.3390/dj13060261
APA StyleRonsivalle, V., Santonocito, S., Cammarata, U., Lo Muzio, E., & Cicciù, M. (2025). Current Applications of Chatbots Powered by Large Language Models in Oral and Maxillofacial Surgery: A Systematic Review. Dentistry Journal, 13(6), 261. https://doi.org/10.3390/dj13060261