1. Introduction
The use of dental implants has evolved over the years from an experimental approach to a reliable treatment option with high predictability for replacing missing teeth with implant-supported prostheses. Today, it is widely used in daily clinical practice for both complete and partial edentulism. Contemporary implant therapy offers significant functional and biological advantages for many patients compared to traditional fixed or removable prostheses [
1].
Advancements and innovations in dental implant systems have enabled successful outcomes. However, the success of dental implants is not only dependent on the properties of the implant material. Detailed surgical plans, depending on the structure of the existing bone and the type of prosthetic rehabilitation to be performed, are critical factors for success [
2]. Comprehensive preoperative planning, both surgically and prosthetically, enables accurate placement of implants for the patient, ensuring restorations that can meet functional and aesthetic expectations [
2,
3].
During diagnosis and treatment planning, the clinician must carefully evaluate both prosthetic and anatomical limitations when selecting an alveolar bone region of sufficient quality for the safe and ideal placement of implants [
4]. Implant planning requires the collaboration of many specialists, including radiologists, oral and maxillofacial surgeons, periodontists, and prosthodontists. This multidisciplinary approach is essential for optimal planning; however, it is often time-consuming and does not always yield a single correct approach [
5,
6].
Recently, artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT (OpenAI, San Francisco, CA, USA), has attracted attention for its potential to enhance diagnostic decision-making and risk classification [
7]. AI, as in other healthcare fields, is becoming part of pre-surgical planning for dental implants [
8,
9]. Machine learning (ML), deep learning (DL), and artificial neural networks (ANNs), which are subfields of AI, offer a wide range of applications from the analysis of dental images to treatment planning, optimization of implant design, and outcome prediction [
10]. AI-supported systems significantly contribute to challenging clinical tasks such as identifying anatomical structures in cone-beam computed tomography (CBCT) images [
11], implant brand identification [
12], bone volume evaluation, and support of surgical protocols [
13]. Furthermore, advanced applications such as predicting implant failure risks [
14] and the development of robot-assisted surgical systems [
15] demonstrate the potential of this technology. These developments enable personalized and data-driven approaches in implantology, making clinical decision-making processes more reliable and effective [
16]. Although recently there are studies evaluating artificial intelligence in implant planning, they have mainly focused on true–false questions [
17] or literature-based assessments [
18]; this study differs by comparing AI responses to real clinical scenarios with participants’ treatment-planning decisions, providing a more practice-oriented evaluation.
This study aimed to compare implant treatment planning decisions between prosthodontic and oral and maxillofacial surgery residents and those generated by ChatGPT 4.0. In this context, the consistency of approaches between the two disciplines, the differences between ChatGPT and other groups in decision-making, the clinical criteria prioritized by the groups, and the proposed treatment plans were analyzed. The null hypothesis of this study is that the clinical decision-making tendencies of clinicians from different disciplines and the clinical decision-making capabilities of AI will not differ from each other.
4. Discussion
Clinical decision-making in implant planning requires consideration of multiple variables, making it unlikely for clinicians to agree on a single treatment option. With the increasing integration of artificial intelligence (AI) into various fields, including healthcare, the potential for AI-based language models such as ChatGPT 4.0 to guide clinical decision-making in implant-supported treatment planning warrants exploration.
The null hypothesis of this study proposed that the clinical decision-making tendencies of artificial intelligence and clinicians from different disciplines would not differ from each other significantly. However, the findings of this study rejected this hypothesis. Contrary to expectations, no significant differences were observed between the clinician groups, while statistically significant differences were noted in most scenarios between AI-generated and human decisions.
The survey scenarios were designed to reflect the commonly seen clinical cases that allow multiple treatment plans. Panoramic radiographs were provided to facilitate the decision-making process, while tomographic data and relevant clinical details were summarized in written form to minimize data load and standardize the evaluation process.
To reflect the complexity of clinical decision-making while ensuring structured comparison, responses were limited to four predetermined options for each case, encouraging participants to select the one most closely aligned with their own choice. The primary aim was not to identify “right vs. wrong” answers but to compare patterns of decision-making. Specifically, we aimed to assess whether ChatGPT aligns with human clinicians in prioritizing surgical vs. prosthetic considerations when multiple valid treatment pathways exist.
The diversity in surgical and prosthetic considerations across questions enabled the analysis of prioritization patterns between AI and human decision-making. For example, in posterior cases with limited bone support, surgical parameters often took precedence, whereas anterior cases required prioritization of prosthetic and aesthetic considerations. In certain questions, alternative options were provided to reflect either conservative or more invasive approaches within surgical planning. Similarly, alternatives regarding abutment type and material selection were presented, aiming to assess whether clinicians and the AI model would consider potential long-term complications of the prosthesis in their decision-making. This diversity within the scenarios may be beneficial for analyzing which factors are prioritized by both clinicians and the AI model. Moreover, such variability allows for a clear evaluation of the reliability of AI responses when managing complex clinical situations.
It was noticed that ChatGPT sometimes gave different answers to the same case when asked at different times. This happens because language models create responses using probabilities, not fixed rules. While this can be useful in general conversations, it causes problems for clinical work where reliable and repeatable decisions are important. For this reason, more studies are needed to make AI responses more stable and consistent before they can be trusted in implant planning.
Since implant treatment planning is a process requiring both surgical and prosthetic competence, this study aimed to evaluate two different disciplines separately. No significant difference was found between the disciplines. This may be due to the educational approach conducted within the same faculty. If similar studies are conducted in different dental faculties, differences in interdisciplinary clinical decision-making processes may emerge. Further studies are needed to determine this.
In 11 out of 14 questions asked, a significant difference was found between human intelligence and artificial intelligence responses. This difference may be related to the variety in knowledge acquisition and the degree of difficulty across the questions.
In clinical scenarios where no statistically significant differences were found, both human and artificial intelligence tended to concentrate on the same choices. Upon examining these scenarios, it was observed that they involved less complicated partial edentulism cases. These scenarios, which clinicians frequently encounter, may also be easier for artificial intelligence to process and may have more comparable cases in the literature.
According to the Cohen’s Kappa analysis, the agreement between the AI and the prosthodontics groups was fair (κ = 0.270), and similarly, the agreement between the AI and the oral and maxillofacial surgery groups was also fair (κ = 0.213). In contrast, a substantial level of agreement was observed between the prosthodontics and oral and maxillofacial surgery groups (κ = 0.618). These findings suggest that while human evaluators from different specialties demonstrated a relatively high level of consistency in their assessments, the AI system showed limited alignment with either expert group. This discrepancy may highlight the need for further refinement of AI algorithms to better reflect expert clinical judgment, especially when applied across different dental specialties.
In Case 1, AI predominantly chose option C, while human participants chose option D. AI’s preference for option C, which involved a closed sinus lift and non-splinted crowns, may indicate that it approaches the issue differently from clinicians in both surgical and prosthetic perspectives. AI, being literature-focused, may have leaned toward minimally invasive surgical procedures and designs closer to the natural dentition. In contrast, human intelligence, based on experience, may have prioritized providing a safer surgical field through open sinus lift and improving biomechanical load distribution by splinting crowns.
In Case 2, although option C was the most preferred across all three groups, there was a statistically significant difference in the distribution between human and artificial intelligence. AI consistently chose option C, while other participants selected other options as well. This may be due to the extensive evidence in the literature suggesting that increasing the number of implants enhances treatment success. Additionally, the preference for fixed prostheses over removable dentures for temporary restorations may have been considered more advantageous in terms of aesthetics and function.
In Case 3, while AI chose option A and planned to use tilted implants with a cantilever-supported fixed prosthesis, the assistants chose option B and planned a structure with posterior support using grafts and zygomatic implants. It can be inferred that AI tended to avoid complex surgical procedures, favoring minimally invasive methods instead. Additionally, due to the limited data available in the literature regarding zygomatic implants, AI may have opted for more commonly documented solutions. Human intelligence, on the other hand, may have leaned toward more aggressive yet contemporary surgical methods for certain cases, based on clinical experience.
In Case 4, while AI and human intelligence preferred the same planning for the maxilla, they differed in their approaches for the mandible. AI preferred a more conservative approach by preserving some of the lower teeth, whereas human intelligence, possibly using clinical foresight to prevent potential future complications, opted for a more radical approach, planning for full-arch implant-supported prostheses after extracting all lower teeth. Additionally, AI may have preferred using short implants to avoid cantilevers, while clinicians tended to place distally tilted implants to design prostheses with cantilevers instead.
In Case 7, although option C was the most preferred in all groups, a significant distribution difference was found between human and artificial intelligence. This could be due to clinicians varying their abutment choices based on individual experiences considering clinical variables such as interocclusal space and retention. AI, on the other hand, may have responded solely based on the available data. While AI may have considered non-splinted crowns advantageous for their natural anatomy in Case 1, in Case 7, it may have preferred splinted crowns in the posterior region to optimize load distribution. This indicates that AI can provide different answers for similar clinical cases, raising questions about the reliability of AI in clinical decision-making.
In Case 8, AI predominantly chose option C, likely opting for a less invasive and faster solution by avoiding additional surgical procedures and using short implants instead of sinus lifts. Clinicians, on the other hand, may have concentrated on the option of recommending standard implants with sinus lifts, considering them more predictable than short implants in terms of case safety and clinical experience.
In Case 10, although all three groups predominantly recommended immediate loading (option A), some clinicians also preferred delayed loading (option B), which led to a statistically significant distribution difference. In this case, AI may have recommended immediate loading based on primary stability criteria in the literature, while clinicians considered the risks of biological complications and opted for more traditional loading protocols. This reflects the differing priorities among groups in the clinical decision-making process.
In Case 11, where the case involved anterior esthetics and immediate implantation, clinicians predominantly chose options C and D, while AI chose option A. This may be because clinicians aimed to avoid the risk of placing implants across the entire area in an anterior region with a history of lesions or, guided by clinical intuition, leaned toward treatments with fewer implants or plans like option D that involve waiting for osseointegration to reduce complications. AI, on the other hand, may have leaned toward option A, which aligns with the current literature emphasizing the success of immediate implantation with immediate provisional prostheses in the esthetic zone, aiming to place as many implants as possible.
In Case 12, although option C was the most preferred across all groups, a significant distribution difference was found between human and artificial intelligence. In this question evaluating abutment types in single-implant restorations, clinicians, based on varied clinical experiences, opted for different choices, whereas AI consistently preferred single crown restorations on Ti-base abutments, which are frequently highlighted in the recent literature.
In Case 13, involving multiple evaluations regarding implant numbers, cantilever lengths, and prosthetic material selection for full-arch fixed prostheses, AI preferred an option with fewer implants and shorter cantilevers using a Ti-bar, while human intelligence opted for restorations with increased implant numbers and cantilevers. This indicates potential differences in prioritization algorithms between human and artificial intelligence during surgical and prosthetic planning. The fact that AI, which previously tended to increase implant numbers, opted for fewer implants with reduced cantilevers in this question may provide insight into AI’s consistency in clinical decision-making.
In Case 14, AI recommended placing six implants in each arch and fabricating a single-piece monolithic zirconia prosthesis using multi-unit abutments. However, clinicians preferred a three-piece prosthesis, considering factors such as the long-term use in fully edentulous patients, the feasibility of segmental approaches during regional surgical or prosthetic interventions, and the reduced technical sensitivity during laboratory procedures compared to screw-retained structures. This difference reveals the contrast between AI systems’ optimization approach, which focuses on production and durability, and the clinical practice prioritizing patient-centered, feasible, and sustainable solutions.
It should be noted that the AI model operated on text-based decision-making and lacked the capability to directly assess patient behaviors, examination findings, and clinical intuition. While the AI model’s data access was limited to online information, the clinicians had access to not only online data but also printed sources, training materials, clinical experiences, and various academic publications, which may have impacted decision-making processes. The observed tendency of ChatGPT toward conservative treatment options reflects this fundamental divergence from clinicians’ clinical intuition and individualized decision-making. As large language models like ChatGPT are trained on accessible literature and similar sources, where conservative approaches are often prioritized, the model is likely to rate options that minimize potential complications as more optimal and to adopt this cautious stance as a reference point. This inclination may be balanced in the future through richer integration of patient-specific data and the development of more context-sensitive algorithms.
This study aimed to evaluate decision-making processes in implant planning but has several limitations. First, scenarios were limited to written case descriptions and panoramic images; complementary clinical data such as 3D radiographic images, clinical photographs, and intraoral examination findings were not included, which may have influenced participants’ planning differently than in real clinical settings. It should be noted that panoramic radiographs alone do not fully reflect real-world practice, where CBCT imaging and comprehensive clinical examinations are essential for accurate treatment planning.
Elgarba et al. [
9] reported that CBCT-based AI implant planning achieved quality comparable to human experts while being substantially faster and more consistent. In contrast, in our study comparing panoramic radiographs with ChatGPT, the AI produced more conservative and significantly different decisions in most cases, likely due to limited input data. These findings suggest that task-specific models trained on three-dimensional datasets can approach human performance, whereas large language models remain more cautious and literature-driven. This study evaluated only the responses of the ChatGPT 4.0 model. Future studies incorporating different AI models in comparative settings could more comprehensively evaluate AI’s contribution to decision-making processes in implant planning.
Additionally, due to the limited number of studies in the literature directly comparing AI and human expert preferences in dental implant planning, the findings of this study should be considered exploratory, and larger-scale studies are needed in the future.