1. Introduction
Artificial intelligence (AI) is now widely used in healthcare, medical and dental education, and many aspects of everyday life [
1,
2]. Recent large language models (LLMs), such as ChatGPT, have expanded AI’s role beyond image-based diagnostics to include natural language processing, knowledge synthesis, and educational support [
3,
4]. As AI becomes more common in educational settings, it is increasingly important to reconsider how learning and assessment are approached in healthcare professions [
5].
In dentistry, artificial intelligence has been increasingly applied to diagnostic imaging, treatment planning, and digital prosthetic workflows [
6,
7,
8]. More recently, researchers have examined how ChatGPT may support dental education, particularly in undergraduate courses and assessments [
9,
10]. One study reported that although students found AI helpful and believed it improved productivity, traditional evidence-based learning still yielded better academic outcomes [
10].
Evidence from medical and dental education suggests that LLMs may perform at a level comparable to, or in some cases exceed, that of students when assessed using structured formats, such as multiple-choice questions and progress tests [
11,
12,
13,
14,
15,
16,
17].
Previous studies have shown that advanced AI models can achieve high levels of accuracy in major examinations. For instance, ChatGPT has been reported to have successfully passed a European certification examination in implant dentistry and, in some cases, to outperform licensed dentists, although performance varies depending on the model version used [
11]. Nevertheless, important concerns remain regarding the reliability of AI-generated responses, as these systems may present incorrect information with high confidence, underscoring the ongoing need for human oversight in AI-assisted education [
2].
The postgraduate education of a prosthodontist requires the integration of advanced theoretical knowledge, evidence-based thinking, and sound clinical judgement. As prosthodontic residents come from diverse educational and training backgrounds, variations in their knowledge levels may be observed. Such variation can be addressed through standardized assessments, and large language models may assist in this process by supporting more consistent responses to structured questions [
3,
14].
Interest in AI-enabled learning has increased; however, few studies have directly compared the performance of prosthodontic residents and large language models using the same assessment instruments. Most dental studies have focused on undergraduate education or evaluated a single AI-enabled tool, thereby limiting the generalizability of findings to postgraduate specialties [
9,
10]. This gap in the literature underscores the need for further investigation to better understand how AI can be integrated into prosthodontic residency training. Despite growing interest in the use of artificial intelligence in dental education, evidence remains limited regarding how large language models perform relative to learners at the postgraduate specialty-training level, particularly in prosthodontics. This comparison is educationally relevant because it may help contextualize the strengths and limitations of these systems within a discipline-specific postgraduate training environment. The novelty of the present study lies in benchmarking multiple contemporary large language models against postgraduate prosthodontic trainees across predefined knowledge areas within a single standardized assessment framework.
The aim of this study was to compare the knowledge of prosthodontic residents and large language models by evaluating their performance on multiple-choice questions in prosthodontics. In addition, this study aimed to investigate how artificial intelligence could be used as a supportive tool in postgraduate prosthodontic education.
Research Questions
This study was designed to address the following research questions:
Is there a difference in response accuracy between prosthodontic residents and large language models when answering standardized multiple-choice questions related to prosthodontics?
Do prosthodontic residents and large language models differ in their performance across different question categories, including basic knowledge, general dentistry, and advanced-level prosthodontic specialty questions?
Is there an association between the duration of prosthodontic specialty training and the response accuracy of residents?
Null Hypothesis (H0)
H0. There is no statistically significant difference in response accuracy between prosthodontic residents and large language models when answering basic knowledge, general dentistry, and advanced-level prosthodontic specialty questions in a standardized multiple-choice exam.
2. Materials and Methods
Ethical approval for this study was obtained from the Çankırı Karatekin University Health Sciences Ethics Committee (Çankırı, Türkiye) (Meeting No: 17, dated 4 December 2024). All study procedures were conducted in accordance with the principles of the Declaration of Helsinki. Participation was voluntary, and written informed consent was obtained from all participants prior to data collection.
This study was designed as a cross-sectional comparative analysis to evaluate performance differences between prosthodontic residents and large language model-based systems using a standardized examination framework. Two comparison groups were included. The human group consisted of prosthodontic residents (N = 32) undergoing specialty training in the Departments of Prosthodontics at the faculties of dentistry. The AI group comprised seven large language model-based chatbots: ChatGPT-4o, ChatGPT-o1, ChatGPT-o3-mini, Claude Sonnet 3.7, Gemini 2.5 Pro, Microsoft Copilot (web interface (
https://copilot.microsoft.com/), accessed in August 2025) and DeepSeek V3.
Inclusion criteria were active enrollment in prosthodontic residency training at the time of the study and voluntary participation in the examination. Exclusion criteria included non-resident status and incomplete participation in the examination. All participants had completed undergraduate dental education and were undergoing postgraduate specialty training in prosthodontics. The year of specialty training was recorded to assess whether performance differed according to training stage.
A 30-item multiple-choice test was administered to the participants. The test consisted of four demographic questions (age, gender, year of residency training, and university) and 26 knowledge-based questions related to prosthodontics.
The knowledge-based questions were divided into three content areas: general dentistry (10 questions), basic prosthodontics (8 questions), and advanced prosthodontic specialty knowledge (8 questions). The number of questions in each area was determined using domain sampling principles to ensure content validity. The general dentistry section was aligned with the emphasis on clinical knowledge commonly assessed in national postgraduate entrance examinations, where basic clinical competencies constitute a major focus. For the basic and advanced prosthodontic sections, the questions were designed to reflect the structure of thesis defenses and oral examinations used in postgraduate dental education in Türkiye [
18]. Following thesis submission, prosthodontic residents are evaluated by a jury of three or four faculty members, each of whom usually poses multiple questions during the defense. Accordingly, including eight questions for each prosthodontic section was intended to reflect the level of challenge, examiner diversity, and depth of evaluation characteristic of this national assessment format. This approach was adopted to enhance the representativeness and content validity of the assessment across different levels of prosthodontic expertise [
18,
19,
20].
The examination items were developed and content validated by three independent Associate Professors of Prosthodontics with extensive academic and clinical experience. Item development followed a structured, multi-expert protocol, consistent with established recommendations for generating validity evidence based on test content.
The questionnaire was designed primarily to assess theoretical knowledge in a structured single-best-answer format. Although some items were clinically oriented in content, the examination was not intended to simulate authentic patient-based clinical decision-making, procedural performance, or real-time treatment planning.
The development process consisted of three main stages. First, during individual item drafting, each expert independently generated an initial pool of questions based on accredited prosthodontic curricula and standard reference textbooks. Second, all draft items underwent peer review by the expert panel to evaluate scientific accuracy, clinical relevance, alignment with predefined content domains, and appropriateness of difficulty level. Finally, item selection was completed during a structured consensus meeting. Only items that achieved unanimous agreement (100%) among the experts regarding accuracy, clarity, and domain alignment were retained, resulting in a final examination comprising 26 knowledge-based questions [
21].
All items were presented in a single-best-response format, allowing objective scoring and direct comparison between human participants and large language models (LLMs). The final 26-item test was administered to seven LLMs in August 2025 using a standardized zero-shot prompting procedure. To minimize contextual bias and carry-over effects, each question was manually entered as an independent query in a newly initiated chat session, so that no prior conversational context could influence subsequent responses. No examples, follow-up prompts, reformulations, or additional contextual guidance were provided, and each item was treated as a standalone input. AI-generated responses were recorded verbatim and scored dichotomously using a predefined answer key (1 = correct, 0 = incorrect), with no penalty for incorrect responses. This standardized approach was intended to ensure that model performance reflected baseline response behavior under minimal prompting conditions rather than effects related to prompt engineering or conversational memory. Because each model was queried once per item during the study period, the findings should be interpreted as a time-specific snapshot of performance under those access conditions.
To maximize comparability, the same 26 questions, answer options, and scoring key were used for both residents and LLMs. Nevertheless, the testing environments were not identical: residents completed the assessment under supervised and time-limited examination conditions, whereas LLMs were queried under non-human, session-based interface conditions without fatigue, test anxiety, or conventional time pressure.
All prosthodontic residents completed the examination simultaneously in a supervised classroom setting with a 50-min time limit. This approach ensured consistent testing conditions, minimized the risk of information sharing among participants, and enhanced the comparability and reliability of the collected responses. Time-limited, group-based multiple-choice examinations have been recommended in medical and dental education to support reliable assessment and reduce measurement bias [
21].
Statistical Analysis
The required sample size was determined a priori using G*Power version 3.1 (Heinrich Heine University, Düsseldorf, Germany). The analysis was conducted to assess the adequacy of the human participant sample size for this examination-based comparative study, assuming a two-tailed design, a significance level (α) of 0.05, a statistical power (1 − β) of 0.80, and a moderate effect size (Cohen’s d = 0.50). Based on these parameters, a minimum total sample size of 30 participants was required. Accordingly, the inclusion of 32 prosthodontic residents met the calculated power requirement [
22].
To avoid pseudoreplication, the primary unit of analysis for the human group was defined as the individual resident rather than each item response. For each resident, total test score (0–26) and domain-specific scores were calculated for basic prosthodontic knowledge (0–8), advanced prosthodontic specialty knowledge (0–8), and general dentistry (0–10). Resident-level performance was summarized using the mean, standard deviation, median, interquartile range, and 95% confidence intervals.
Because each large language model generated a single set of responses, LLM performance was reported descriptively as the number and percentage of correct answers overall and by domain. Differences between the resident mean score and each LLM score were presented descriptively with 95% confidence intervals. Associations between residents’ scores and year of specialty training were assessed at the resident level. Normality was evaluated using the Shapiro–Wilk test, and correlations were examined using Kendall’s Tau-b. Statistical analyses were performed using IBM SPSS Statistics version 27.0, and p < 0.05 was considered statistically significant.
3. Results
Thirty-two prosthodontic residents participated in the study. The demographic characteristics of the participants, including gender and year of specialty training, are presented in
Table 1.
Overall, most participants were female. With respect to the year of specialty training, residents were distributed across different training stages, with the largest proportion in the second year of residency, whereas residents at the 2.5-year stage constituted the smallest group.
3.1. Resident-Level Performance
At the resident level, the mean total score was 14.50 ± 2.88 out of 26 (median: 15; IQR: 13–17; 95% CI: 13.46–15.54). Domain-specific mean scores were 4.84 ± 1.44 for basic prosthodontic knowledge, 2.91 ± 1.28 for advanced prosthodontic specialty knowledge, and 6.75 ± 1.46 for general dentistry (
Table 2).
3.2. LLM Benchmark Performance
LLM total scores ranged from 17/26 to 21/26. In the advanced prosthodontic specialty domain, all LLMs scored higher than the resident mean, with model scores ranging from 6/8 to 7/8 compared with a resident mean of 2.91/8. In contrast, differences were smaller in the basic prosthodontic knowledge and general dentistry domains. As each LLM generated a single set of responses, the results were interpreted descriptively as benchmark scores rather than as participant-level observations (
Table 2).
3.3. Comparison of Response Accuracy Across Study Groups
The distribution of correct and incorrect responses across study groups is presented in
Table 3.
A statistically significant association was observed between study groups and response accuracy for advanced prosthodontic specialty questions (χ2 = 27.175, p < 0.001). The human group demonstrated a lower correct response rate (41.5%) compared with all LLMs. The highest accuracy rates (87.5%) were observed for ChatGPT-o3-mini, Claude Sonnet 3.7, and Microsoft Copilot, whereas ChatGPT-4o, ChatGPT-o1, Gemini, and DeepSeek achieved correct response rates of 75%.
No statistically significant differences were observed between the human and artificial intelligence groups in responses to basic prosthodontic knowledge, general dentistry, or total questions (
p > 0.05) (
Table 3).
3.4. Effect Size Analysis
Effect size analysis using Cramér’s V is presented in
Table 4.
A moderate effect size was observed for advanced prosthodontic specialty questions (V = 0.316, p < 0.001), indicating a meaningful association between study groups and response accuracy.
In contrast, small and non-significant effect sizes were observed for basic prosthodontic knowledge (V = 0.166, p = 0.328), general dentistry (V = 0.122, p = 0.582), and total scores (V = 0.109, p = 0.127), suggesting limited practical differences between study groups in these domains.
3.5. Association with Year of Training
No statistically significant association was observed between year of specialty training and resident performance across any domain (Kendall’s Tau-b, all
p > 0.05) (
Table 5).
These findings indicate that progression in specialty training was not associated with improved performance in either basic or advanced prosthodontic knowledge domains.
The difference between the resident mean total score and LLM total scores ranged from −2.50 to −6.50 points. For the advanced prosthodontic specialty domain, the difference between the resident mean and LLM scores ranged from −3.09 to −4.09 points, with all 95% confidence intervals excluding zero.
4. Discussion
Prosthodontic training integrates practical, case-based clinical experience with structured theoretical instruction, thereby supporting the development of both clinical competence and foundational knowledge. In this study, the performance of prosthodontic residents and large language models (LLMs) was compared, and the potential influence of specialty training duration on response accuracy was examined.
With respect to the first research question, large language models (LLMs) demonstrated comparable performance to prosthodontic residents in basic knowledge and general dentistry, while demonstrating higher accuracy in advanced prosthodontic specialty questions. This finding is consistent with previous studies indicating that LLMs perform well on structured assessments that emphasize standardized knowledge and factual recall rather than higher-order clinical reasoning [
23].
Regarding the second research question, a statistically significant difference was observed in responses to advanced-level prosthodontic specialty questions, with large language models achieving higher correct-response rates than prosthodontic residents. Advanced theoretical questions often require access to a broad and continuously expanding knowledge base, including current clinical guidelines and specialized literature. Because LLMs are trained on large-scale textual datasets, they are well-suited to retrieving and synthesizing such information efficiently. Similar studies in medical and dental education have reported that artificial intelligence performs well on theoretical examinations that require domain-specific expertise [
23]. It is essential to clarify that such proficiency primarily reflects advanced informational retrieval capacity rather than the clinical expertise characteristic of a trained practitioner.
More recently, Salem et al. reported that ChatGPT-3.5 and ChatGPT-4o achieved high accuracy when answering International Team for Implantology (ITI) examination questions, thereby supporting the capacity of large language models to perform well in advanced, specialty-level theoretical assessments [
24]. Rather than reflecting simple factual recall, this performance is likely related to LLMs’ ability to integrate information across broad, heterogeneous sources in the medical and dental literature. In contrast, prosthodontic residents typically develop their knowledge within structured curricula and predefined clinical rotations, whereas LLMs can rapidly synthesize current clinical guidelines and infrequently encountered case reports from multiple sources, often in near-real time [
14,
17,
20,
23].
Despite these findings, it is important to recognize that strong performance in multiple-choice examinations primarily reflects theoretical knowledge rather than overall clinical competence. Although LLMs demonstrated higher accuracy in advanced theoretical questions, this should not be interpreted as evidence of superior clinical competence, as real-world prosthodontic decision-making requires contextual judgment, hands-on skills, and patient-centered considerations that are not captured by MCQ-based assessments.
An additional limitation is the imperfect equivalence of testing conditions between residents and LLMs. Although the same questions, answer options, and scoring key were used, residents were assessed under supervised and time-limited examination conditions, whereas LLMs were evaluated in a session-based zero-shot format. Therefore, the findings should be interpreted as a benchmark comparison under the study conditions rather than a perfectly equivalent head-to-head examination setting. Additional limitations should also be considered. First, LLM outputs may vary across sessions, prompt phrasing, platform implementations, and model updates; therefore, the present findings should be interpreted as time-specific estimates rather than fixed model characteristics. Second, although responses were scored against a predefined answer key, LLMs remain susceptible to hallucinations and may generate plausible but incorrect outputs, which limits the reliability of isolated high scores. Third, the study was based on a relatively small set of structured multiple-choice questions and therefore captures only a limited portion of prosthodontic knowledge. This format does not adequately assess procedural skills, clinical reasoning in authentic patient contexts, communication, or patient-centered decision-making.
In addition, the study was based on a relatively limited 26-item questionnaire and a single-institution sample of prosthodontic residents, which may restrict the generalizability of the findings. Although the examination provided a standardized written benchmark, it represents only a limited sample of domain knowledge and may not fully reflect broader postgraduate training environments. Therefore, these findings should be interpreted with caution and should not be directly extrapolated to broader educational or real-world clinical settings without further validation. Future studies should include larger, multi-institutional participant samples and incorporate case-based, scenario-based, and performance-based assessment methods to better evaluate the educational role and practical reliability of LLMs in prosthodontic training.
Such written assessments have inherent limitations in capturing key aspects of prosthodontic practice, including individualized technical skills, context-dependent clinical decision-making, and patient-centered care [
25]. It should also be noted that the high performance of large language models in written examinations does not necessarily indicate stable or reproducible performance across repeated testing. Previous research has demonstrated that responses generated by these models may vary depending on timing or minor differences in input, representing an important limitation when interpreting their accuracy in educational settings [
1,
2,
3,
4,
11,
19,
20].
Prosthodontic training is primarily experiential, requiring hands-on practice, ethical judgment, and clinical reasoning. However, artificial intelligence systems remain limited in supporting experiential learning and the development of clinical intuition. Therefore, human oversight and supervision must be ensured when AI-based applications are incorporated into postgraduate dental education.
Another practical concern is that language models may occasionally generate inaccurate or incomplete information in a fluent and confident manner. In educational settings, particularly for trainees still developing foundational knowledge, this characteristic of fluent yet potentially inaccurate output may increase the risk of misinterpretation when AI-generated outputs are accepted without critical appraisal or appropriate expert supervision [
15,
16,
17,
18,
19,
20,
21,
22].
Prosthodontic residents and large language models (LLMs) demonstrated no statistically significant differences in their responses to basic knowledge and general dentistry questions. This finding suggests that both human participants and AI systems exhibit comparable performance in accessing core dental knowledge. Previous studies have reported that LLMs perform well on memory-based and standardized knowledge tasks, highlighting their potential to reinforce foundational concepts in postgraduate dental education [
23].
Regarding the third research question, neither comparative nor correlational analyses revealed a statistically significant association between the duration of prosthodontic specialty training and the accuracy of the residents’ responses. This finding suggests that extended time spent in clinical training does not necessarily translate into improved performance on standardized theoretical examinations. Similar observations have been reported in the medical and dental education literature, indicating that progression through training is not always accompanied by higher scores on written assessments [
26,
27]. This discrepancy may reflect the inherent limitations of multiple-choice question formats, which cannot fully capture experiential clinical learning and case-based reasoning.
This finding aligns with well-established educational frameworks, such as Miller’s pyramid of clinical competence, which suggests that written examinations primarily assess factual knowledge and applied understanding. In contrast, higher levels of clinical performance and real-world decision-making require direct observation within authentic clinical settings [
28]. The lack of a significant association between residents’ training duration and test performance is consistent with the view that prosthodontic specialty education prioritizes the development of clinical reasoning and psychomotor skills over the mere accumulation of theoretical knowledge. Accordingly, multiple-choice examinations are largely confined to evaluating the lower tiers of Miller’s pyramid, namely, “knows” and “knows how”, and remain insufficient for fully capturing actual clinical performance, which corresponds to the “shows how” and “does” levels of the framework [
28].
When responses to all questions were evaluated collectively, partial support for the null hypothesis was observed. Specifically, statistically significant differences were identified between prosthodontic residents and large language models in responses to advanced-level prosthodontic specialty questions, whereas no statistically significant differences were observed in responses to basic knowledge or general dentistry questions.
From an educational perspective, the current findings indicate that large language models cannot replace applied clinical training but may serve as supportive and complementary tools in prosthodontic specialty education. Recent studies in dental education have highlighted the potential of artificial intelligence to facilitate self-directed learning, support formative assessment, and improve access to up-to-date information. However, these studies also point to ongoing uncertainties regarding the integration of AI applications into existing curricula, the preparedness of academic staff for AI-assisted teaching, and the effective management of this integration process [
29].
In addition, the performance of large language models is influenced by the way questions are phrased and presented. Even minor variations in wording may alter generated responses, thereby limiting direct comparability with human examinees and highlighting the need for carefully designed and standardized evaluation approaches when AI systems are assessed alongside human learners [
20,
21,
22,
23,
24,
25,
26,
27].
The combined use of human expertise and artificial intelligence has been recognized as an opportunity to enhance professional performance, particularly by improving the accuracy of clinical diagnosis [
30]. In parallel, educational theories emphasize that no single assessment method can fully capture clinical competence, underscoring the need to integrate written examinations with performance-based evaluations, professional experience, and patient-centered assessment approaches [
31]. Accordingly, although large language models may be particularly valuable for reinforcing advanced theoretical knowledge, the core of specialist training remains grounded in clinical examination, patient-centered diagnosis, and treatment.
The findings of the present study are consistent with our previous work examining the performance of dental interns and artificial intelligence models in a dental specialty entrance examination, which demonstrated that although AI-based systems and dental interns achieved comparable accuracy on structured multiple-choice assessments, both exhibited clear limitations in clinically interpretable and context-dependent scenarios [
20].
Taken together, these findings indicate that artificial intelligence should be viewed as a complement to experiential clinical training in prosthodontic specialty education rather than as a replacement. Accordingly, future prosthodontic curricula could benefit from treating artificial intelligence not as a rival force but as an educational resource that enables residents to efficiently formulate and critically evaluate AI-generated information. Conceptualizing AI as a “co-pilot” may help reduce routine theoretical workload, thereby enabling residents to devote greater attention to complex clinical cases, advanced clinical decision-making, and effective patient communication.