A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions

Ates, Gül; Bulut, Ali Can

doi:10.3390/app16073296

Open AccessArticle

A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions

by

Gül Ates

^*

and

Ali Can Bulut

Department of Prosthodontics, Faculty of Dentistry, Yıldırım Beyazit University, 06800 Ankara, Turkey

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3296; https://doi.org/10.3390/app16073296

Submission received: 22 February 2026 / Revised: 26 March 2026 / Accepted: 27 March 2026 / Published: 29 March 2026

Download Versions Notes

Abstract

Recent advances in artificial intelligence have expanded the use of large language models (LLMs) beyond speech-based applications and increased interest in their potential roles in dental education. However, evidence regarding LLM performance in postgraduate dental education, particularly in prosthodontics, remains limited. Therefore, this study aimed to compare the accuracy of responses from prosthodontic residents and LLMs to standardized multiple-choice questions in prosthodontics and to explore the potential role of artificial intelligence in prosthodontic education. Thirty-two prosthodontic residents participated in this cross-sectional study. Participants completed a standardized 30-item multiple-choice test comprising four demographic items and 26 questions assessing basic knowledge, general dentistry, and advanced prosthodontic specialty questions. The same questions were administered to seven large language models (LLMs): ChatGPT-4o, ChatGPT-o1, ChatGPT-o3-mini, Claude Sonnet 3.7, Gemini 2.5 Pro, Microsoft Copilot (web interface, accessed in August 2025), and DeepSeek V3. Response accuracy and consistency were evaluated. Statistical analyses were performed using IBM SPSS Statistics (version 27.0), with statistical significance set at p < 0.05. A statistically significant difference was observed between prosthodontic residents and LLMs in responses to advanced-level prosthodontic specialty questions (p < 0.05), with higher correct response rates recorded for LLMs. No statistically significant differences were identified between the two groups for basic knowledge and general dentistry questions (p > 0.05). In addition, no significant association was found between the duration of prosthodontic residency training and residents’ response accuracy (p > 0.05). LLMs achieved high scores on this structured MCQ-based assessment, particularly in advanced theoretical prosthodontic items. However, these findings should be interpreted with caution within the limits of a written examination format and do not represent overall clinical competence or real-world patient care performance. Accordingly, artificial intelligence may be considered a supportive educational tool in postgraduate prosthodontic education rather than a replacement for clinical training.

Keywords:

artificial intelligence; large language models; prosthodontic education; postgraduate dental education

1. Introduction

Artificial intelligence (AI) is now widely used in healthcare, medical and dental education, and many aspects of everyday life [1,2]. Recent large language models (LLMs), such as ChatGPT, have expanded AI’s role beyond image-based diagnostics to include natural language processing, knowledge synthesis, and educational support [3,4]. As AI becomes more common in educational settings, it is increasingly important to reconsider how learning and assessment are approached in healthcare professions [5].

In dentistry, artificial intelligence has been increasingly applied to diagnostic imaging, treatment planning, and digital prosthetic workflows [6,7,8]. More recently, researchers have examined how ChatGPT may support dental education, particularly in undergraduate courses and assessments [9,10]. One study reported that although students found AI helpful and believed it improved productivity, traditional evidence-based learning still yielded better academic outcomes [10].

Evidence from medical and dental education suggests that LLMs may perform at a level comparable to, or in some cases exceed, that of students when assessed using structured formats, such as multiple-choice questions and progress tests [11,12,13,14,15,16,17].

Previous studies have shown that advanced AI models can achieve high levels of accuracy in major examinations. For instance, ChatGPT has been reported to have successfully passed a European certification examination in implant dentistry and, in some cases, to outperform licensed dentists, although performance varies depending on the model version used [11]. Nevertheless, important concerns remain regarding the reliability of AI-generated responses, as these systems may present incorrect information with high confidence, underscoring the ongoing need for human oversight in AI-assisted education [2].

The postgraduate education of a prosthodontist requires the integration of advanced theoretical knowledge, evidence-based thinking, and sound clinical judgement. As prosthodontic residents come from diverse educational and training backgrounds, variations in their knowledge levels may be observed. Such variation can be addressed through standardized assessments, and large language models may assist in this process by supporting more consistent responses to structured questions [3,14].

Interest in AI-enabled learning has increased; however, few studies have directly compared the performance of prosthodontic residents and large language models using the same assessment instruments. Most dental studies have focused on undergraduate education or evaluated a single AI-enabled tool, thereby limiting the generalizability of findings to postgraduate specialties [9,10]. This gap in the literature underscores the need for further investigation to better understand how AI can be integrated into prosthodontic residency training. Despite growing interest in the use of artificial intelligence in dental education, evidence remains limited regarding how large language models perform relative to learners at the postgraduate specialty-training level, particularly in prosthodontics. This comparison is educationally relevant because it may help contextualize the strengths and limitations of these systems within a discipline-specific postgraduate training environment. The novelty of the present study lies in benchmarking multiple contemporary large language models against postgraduate prosthodontic trainees across predefined knowledge areas within a single standardized assessment framework.

The aim of this study was to compare the knowledge of prosthodontic residents and large language models by evaluating their performance on multiple-choice questions in prosthodontics. In addition, this study aimed to investigate how artificial intelligence could be used as a supportive tool in postgraduate prosthodontic education.

Research Questions

This study was designed to address the following research questions:

Is there a difference in response accuracy between prosthodontic residents and large language models when answering standardized multiple-choice questions related to prosthodontics?
Do prosthodontic residents and large language models differ in their performance across different question categories, including basic knowledge, general dentistry, and advanced-level prosthodontic specialty questions?
Is there an association between the duration of prosthodontic specialty training and the response accuracy of residents?

Null Hypothesis (H₀)

H₀.

There is no statistically significant difference in response accuracy between prosthodontic residents and large language models when answering basic knowledge, general dentistry, and advanced-level prosthodontic specialty questions in a standardized multiple-choice exam.

2. Materials and Methods

Ethical approval for this study was obtained from the Çankırı Karatekin University Health Sciences Ethics Committee (Çankırı, Türkiye) (Meeting No: 17, dated 4 December 2024). All study procedures were conducted in accordance with the principles of the Declaration of Helsinki. Participation was voluntary, and written informed consent was obtained from all participants prior to data collection.

This study was designed as a cross-sectional comparative analysis to evaluate performance differences between prosthodontic residents and large language model-based systems using a standardized examination framework. Two comparison groups were included. The human group consisted of prosthodontic residents (N = 32) undergoing specialty training in the Departments of Prosthodontics at the faculties of dentistry. The AI group comprised seven large language model-based chatbots: ChatGPT-4o, ChatGPT-o1, ChatGPT-o3-mini, Claude Sonnet 3.7, Gemini 2.5 Pro, Microsoft Copilot (web interface (https://copilot.microsoft.com/), accessed in August 2025) and DeepSeek V3.

Inclusion criteria were active enrollment in prosthodontic residency training at the time of the study and voluntary participation in the examination. Exclusion criteria included non-resident status and incomplete participation in the examination. All participants had completed undergraduate dental education and were undergoing postgraduate specialty training in prosthodontics. The year of specialty training was recorded to assess whether performance differed according to training stage.

A 30-item multiple-choice test was administered to the participants. The test consisted of four demographic questions (age, gender, year of residency training, and university) and 26 knowledge-based questions related to prosthodontics.

The knowledge-based questions were divided into three content areas: general dentistry (10 questions), basic prosthodontics (8 questions), and advanced prosthodontic specialty knowledge (8 questions). The number of questions in each area was determined using domain sampling principles to ensure content validity. The general dentistry section was aligned with the emphasis on clinical knowledge commonly assessed in national postgraduate entrance examinations, where basic clinical competencies constitute a major focus. For the basic and advanced prosthodontic sections, the questions were designed to reflect the structure of thesis defenses and oral examinations used in postgraduate dental education in Türkiye [18]. Following thesis submission, prosthodontic residents are evaluated by a jury of three or four faculty members, each of whom usually poses multiple questions during the defense. Accordingly, including eight questions for each prosthodontic section was intended to reflect the level of challenge, examiner diversity, and depth of evaluation characteristic of this national assessment format. This approach was adopted to enhance the representativeness and content validity of the assessment across different levels of prosthodontic expertise [18,19,20].

The examination items were developed and content validated by three independent Associate Professors of Prosthodontics with extensive academic and clinical experience. Item development followed a structured, multi-expert protocol, consistent with established recommendations for generating validity evidence based on test content.

The questionnaire was designed primarily to assess theoretical knowledge in a structured single-best-answer format. Although some items were clinically oriented in content, the examination was not intended to simulate authentic patient-based clinical decision-making, procedural performance, or real-time treatment planning.

The development process consisted of three main stages. First, during individual item drafting, each expert independently generated an initial pool of questions based on accredited prosthodontic curricula and standard reference textbooks. Second, all draft items underwent peer review by the expert panel to evaluate scientific accuracy, clinical relevance, alignment with predefined content domains, and appropriateness of difficulty level. Finally, item selection was completed during a structured consensus meeting. Only items that achieved unanimous agreement (100%) among the experts regarding accuracy, clarity, and domain alignment were retained, resulting in a final examination comprising 26 knowledge-based questions [21].

All items were presented in a single-best-response format, allowing objective scoring and direct comparison between human participants and large language models (LLMs). The final 26-item test was administered to seven LLMs in August 2025 using a standardized zero-shot prompting procedure. To minimize contextual bias and carry-over effects, each question was manually entered as an independent query in a newly initiated chat session, so that no prior conversational context could influence subsequent responses. No examples, follow-up prompts, reformulations, or additional contextual guidance were provided, and each item was treated as a standalone input. AI-generated responses were recorded verbatim and scored dichotomously using a predefined answer key (1 = correct, 0 = incorrect), with no penalty for incorrect responses. This standardized approach was intended to ensure that model performance reflected baseline response behavior under minimal prompting conditions rather than effects related to prompt engineering or conversational memory. Because each model was queried once per item during the study period, the findings should be interpreted as a time-specific snapshot of performance under those access conditions.

To maximize comparability, the same 26 questions, answer options, and scoring key were used for both residents and LLMs. Nevertheless, the testing environments were not identical: residents completed the assessment under supervised and time-limited examination conditions, whereas LLMs were queried under non-human, session-based interface conditions without fatigue, test anxiety, or conventional time pressure.

All prosthodontic residents completed the examination simultaneously in a supervised classroom setting with a 50-min time limit. This approach ensured consistent testing conditions, minimized the risk of information sharing among participants, and enhanced the comparability and reliability of the collected responses. Time-limited, group-based multiple-choice examinations have been recommended in medical and dental education to support reliable assessment and reduce measurement bias [21].

Statistical Analysis

The required sample size was determined a priori using G*Power version 3.1 (Heinrich Heine University, Düsseldorf, Germany). The analysis was conducted to assess the adequacy of the human participant sample size for this examination-based comparative study, assuming a two-tailed design, a significance level (α) of 0.05, a statistical power (1 − β) of 0.80, and a moderate effect size (Cohen’s d = 0.50). Based on these parameters, a minimum total sample size of 30 participants was required. Accordingly, the inclusion of 32 prosthodontic residents met the calculated power requirement [22].

To avoid pseudoreplication, the primary unit of analysis for the human group was defined as the individual resident rather than each item response. For each resident, total test score (0–26) and domain-specific scores were calculated for basic prosthodontic knowledge (0–8), advanced prosthodontic specialty knowledge (0–8), and general dentistry (0–10). Resident-level performance was summarized using the mean, standard deviation, median, interquartile range, and 95% confidence intervals.

Because each large language model generated a single set of responses, LLM performance was reported descriptively as the number and percentage of correct answers overall and by domain. Differences between the resident mean score and each LLM score were presented descriptively with 95% confidence intervals. Associations between residents’ scores and year of specialty training were assessed at the resident level. Normality was evaluated using the Shapiro–Wilk test, and correlations were examined using Kendall’s Tau-b. Statistical analyses were performed using IBM SPSS Statistics version 27.0, and p < 0.05 was considered statistically significant.

3. Results

Thirty-two prosthodontic residents participated in the study. The demographic characteristics of the participants, including gender and year of specialty training, are presented in Table 1.

Overall, most participants were female. With respect to the year of specialty training, residents were distributed across different training stages, with the largest proportion in the second year of residency, whereas residents at the 2.5-year stage constituted the smallest group.

3.1. Resident-Level Performance

At the resident level, the mean total score was 14.50 ± 2.88 out of 26 (median: 15; IQR: 13–17; 95% CI: 13.46–15.54). Domain-specific mean scores were 4.84 ± 1.44 for basic prosthodontic knowledge, 2.91 ± 1.28 for advanced prosthodontic specialty knowledge, and 6.75 ± 1.46 for general dentistry (Table 2).

3.2. LLM Benchmark Performance

LLM total scores ranged from 17/26 to 21/26. In the advanced prosthodontic specialty domain, all LLMs scored higher than the resident mean, with model scores ranging from 6/8 to 7/8 compared with a resident mean of 2.91/8. In contrast, differences were smaller in the basic prosthodontic knowledge and general dentistry domains. As each LLM generated a single set of responses, the results were interpreted descriptively as benchmark scores rather than as participant-level observations (Table 2).

3.3. Comparison of Response Accuracy Across Study Groups

The distribution of correct and incorrect responses across study groups is presented in Table 3.

A statistically significant association was observed between study groups and response accuracy for advanced prosthodontic specialty questions (χ² = 27.175, p < 0.001). The human group demonstrated a lower correct response rate (41.5%) compared with all LLMs. The highest accuracy rates (87.5%) were observed for ChatGPT-o3-mini, Claude Sonnet 3.7, and Microsoft Copilot, whereas ChatGPT-4o, ChatGPT-o1, Gemini, and DeepSeek achieved correct response rates of 75%.

No statistically significant differences were observed between the human and artificial intelligence groups in responses to basic prosthodontic knowledge, general dentistry, or total questions (p > 0.05) (Table 3).

3.4. Effect Size Analysis

Effect size analysis using Cramér’s V is presented in Table 4.

A moderate effect size was observed for advanced prosthodontic specialty questions (V = 0.316, p < 0.001), indicating a meaningful association between study groups and response accuracy.

In contrast, small and non-significant effect sizes were observed for basic prosthodontic knowledge (V = 0.166, p = 0.328), general dentistry (V = 0.122, p = 0.582), and total scores (V = 0.109, p = 0.127), suggesting limited practical differences between study groups in these domains.

3.5. Association with Year of Training

No statistically significant association was observed between year of specialty training and resident performance across any domain (Kendall’s Tau-b, all p > 0.05) (Table 5).

These findings indicate that progression in specialty training was not associated with improved performance in either basic or advanced prosthodontic knowledge domains.

The difference between the resident mean total score and LLM total scores ranged from −2.50 to −6.50 points. For the advanced prosthodontic specialty domain, the difference between the resident mean and LLM scores ranged from −3.09 to −4.09 points, with all 95% confidence intervals excluding zero.

4. Discussion

Prosthodontic training integrates practical, case-based clinical experience with structured theoretical instruction, thereby supporting the development of both clinical competence and foundational knowledge. In this study, the performance of prosthodontic residents and large language models (LLMs) was compared, and the potential influence of specialty training duration on response accuracy was examined.

With respect to the first research question, large language models (LLMs) demonstrated comparable performance to prosthodontic residents in basic knowledge and general dentistry, while demonstrating higher accuracy in advanced prosthodontic specialty questions. This finding is consistent with previous studies indicating that LLMs perform well on structured assessments that emphasize standardized knowledge and factual recall rather than higher-order clinical reasoning [23].

Regarding the second research question, a statistically significant difference was observed in responses to advanced-level prosthodontic specialty questions, with large language models achieving higher correct-response rates than prosthodontic residents. Advanced theoretical questions often require access to a broad and continuously expanding knowledge base, including current clinical guidelines and specialized literature. Because LLMs are trained on large-scale textual datasets, they are well-suited to retrieving and synthesizing such information efficiently. Similar studies in medical and dental education have reported that artificial intelligence performs well on theoretical examinations that require domain-specific expertise [23]. It is essential to clarify that such proficiency primarily reflects advanced informational retrieval capacity rather than the clinical expertise characteristic of a trained practitioner.

More recently, Salem et al. reported that ChatGPT-3.5 and ChatGPT-4o achieved high accuracy when answering International Team for Implantology (ITI) examination questions, thereby supporting the capacity of large language models to perform well in advanced, specialty-level theoretical assessments [24]. Rather than reflecting simple factual recall, this performance is likely related to LLMs’ ability to integrate information across broad, heterogeneous sources in the medical and dental literature. In contrast, prosthodontic residents typically develop their knowledge within structured curricula and predefined clinical rotations, whereas LLMs can rapidly synthesize current clinical guidelines and infrequently encountered case reports from multiple sources, often in near-real time [14,17,20,23].

Despite these findings, it is important to recognize that strong performance in multiple-choice examinations primarily reflects theoretical knowledge rather than overall clinical competence. Although LLMs demonstrated higher accuracy in advanced theoretical questions, this should not be interpreted as evidence of superior clinical competence, as real-world prosthodontic decision-making requires contextual judgment, hands-on skills, and patient-centered considerations that are not captured by MCQ-based assessments.

An additional limitation is the imperfect equivalence of testing conditions between residents and LLMs. Although the same questions, answer options, and scoring key were used, residents were assessed under supervised and time-limited examination conditions, whereas LLMs were evaluated in a session-based zero-shot format. Therefore, the findings should be interpreted as a benchmark comparison under the study conditions rather than a perfectly equivalent head-to-head examination setting. Additional limitations should also be considered. First, LLM outputs may vary across sessions, prompt phrasing, platform implementations, and model updates; therefore, the present findings should be interpreted as time-specific estimates rather than fixed model characteristics. Second, although responses were scored against a predefined answer key, LLMs remain susceptible to hallucinations and may generate plausible but incorrect outputs, which limits the reliability of isolated high scores. Third, the study was based on a relatively small set of structured multiple-choice questions and therefore captures only a limited portion of prosthodontic knowledge. This format does not adequately assess procedural skills, clinical reasoning in authentic patient contexts, communication, or patient-centered decision-making.

In addition, the study was based on a relatively limited 26-item questionnaire and a single-institution sample of prosthodontic residents, which may restrict the generalizability of the findings. Although the examination provided a standardized written benchmark, it represents only a limited sample of domain knowledge and may not fully reflect broader postgraduate training environments. Therefore, these findings should be interpreted with caution and should not be directly extrapolated to broader educational or real-world clinical settings without further validation. Future studies should include larger, multi-institutional participant samples and incorporate case-based, scenario-based, and performance-based assessment methods to better evaluate the educational role and practical reliability of LLMs in prosthodontic training.

Such written assessments have inherent limitations in capturing key aspects of prosthodontic practice, including individualized technical skills, context-dependent clinical decision-making, and patient-centered care [25]. It should also be noted that the high performance of large language models in written examinations does not necessarily indicate stable or reproducible performance across repeated testing. Previous research has demonstrated that responses generated by these models may vary depending on timing or minor differences in input, representing an important limitation when interpreting their accuracy in educational settings [1,2,3,4,11,19,20].

Prosthodontic training is primarily experiential, requiring hands-on practice, ethical judgment, and clinical reasoning. However, artificial intelligence systems remain limited in supporting experiential learning and the development of clinical intuition. Therefore, human oversight and supervision must be ensured when AI-based applications are incorporated into postgraduate dental education.

Another practical concern is that language models may occasionally generate inaccurate or incomplete information in a fluent and confident manner. In educational settings, particularly for trainees still developing foundational knowledge, this characteristic of fluent yet potentially inaccurate output may increase the risk of misinterpretation when AI-generated outputs are accepted without critical appraisal or appropriate expert supervision [15,16,17,18,19,20,21,22].

Prosthodontic residents and large language models (LLMs) demonstrated no statistically significant differences in their responses to basic knowledge and general dentistry questions. This finding suggests that both human participants and AI systems exhibit comparable performance in accessing core dental knowledge. Previous studies have reported that LLMs perform well on memory-based and standardized knowledge tasks, highlighting their potential to reinforce foundational concepts in postgraduate dental education [23].

Regarding the third research question, neither comparative nor correlational analyses revealed a statistically significant association between the duration of prosthodontic specialty training and the accuracy of the residents’ responses. This finding suggests that extended time spent in clinical training does not necessarily translate into improved performance on standardized theoretical examinations. Similar observations have been reported in the medical and dental education literature, indicating that progression through training is not always accompanied by higher scores on written assessments [26,27]. This discrepancy may reflect the inherent limitations of multiple-choice question formats, which cannot fully capture experiential clinical learning and case-based reasoning.

This finding aligns with well-established educational frameworks, such as Miller’s pyramid of clinical competence, which suggests that written examinations primarily assess factual knowledge and applied understanding. In contrast, higher levels of clinical performance and real-world decision-making require direct observation within authentic clinical settings [28]. The lack of a significant association between residents’ training duration and test performance is consistent with the view that prosthodontic specialty education prioritizes the development of clinical reasoning and psychomotor skills over the mere accumulation of theoretical knowledge. Accordingly, multiple-choice examinations are largely confined to evaluating the lower tiers of Miller’s pyramid, namely, “knows” and “knows how”, and remain insufficient for fully capturing actual clinical performance, which corresponds to the “shows how” and “does” levels of the framework [28].

When responses to all questions were evaluated collectively, partial support for the null hypothesis was observed. Specifically, statistically significant differences were identified between prosthodontic residents and large language models in responses to advanced-level prosthodontic specialty questions, whereas no statistically significant differences were observed in responses to basic knowledge or general dentistry questions.

From an educational perspective, the current findings indicate that large language models cannot replace applied clinical training but may serve as supportive and complementary tools in prosthodontic specialty education. Recent studies in dental education have highlighted the potential of artificial intelligence to facilitate self-directed learning, support formative assessment, and improve access to up-to-date information. However, these studies also point to ongoing uncertainties regarding the integration of AI applications into existing curricula, the preparedness of academic staff for AI-assisted teaching, and the effective management of this integration process [29].

In addition, the performance of large language models is influenced by the way questions are phrased and presented. Even minor variations in wording may alter generated responses, thereby limiting direct comparability with human examinees and highlighting the need for carefully designed and standardized evaluation approaches when AI systems are assessed alongside human learners [20,21,22,23,24,25,26,27].

The combined use of human expertise and artificial intelligence has been recognized as an opportunity to enhance professional performance, particularly by improving the accuracy of clinical diagnosis [30]. In parallel, educational theories emphasize that no single assessment method can fully capture clinical competence, underscoring the need to integrate written examinations with performance-based evaluations, professional experience, and patient-centered assessment approaches [31]. Accordingly, although large language models may be particularly valuable for reinforcing advanced theoretical knowledge, the core of specialist training remains grounded in clinical examination, patient-centered diagnosis, and treatment.

The findings of the present study are consistent with our previous work examining the performance of dental interns and artificial intelligence models in a dental specialty entrance examination, which demonstrated that although AI-based systems and dental interns achieved comparable accuracy on structured multiple-choice assessments, both exhibited clear limitations in clinically interpretable and context-dependent scenarios [20].

Taken together, these findings indicate that artificial intelligence should be viewed as a complement to experiential clinical training in prosthodontic specialty education rather than as a replacement. Accordingly, future prosthodontic curricula could benefit from treating artificial intelligence not as a rival force but as an educational resource that enables residents to efficiently formulate and critically evaluate AI-generated information. Conceptualizing AI as a “co-pilot” may help reduce routine theoretical workload, thereby enabling residents to devote greater attention to complex clinical cases, advanced clinical decision-making, and effective patient communication.

5. Conclusions

The null hypothesis was rejected for the advanced prosthodontic specialty domain, in which the LLM benchmark scores exceeded the mean resident score under the study conditions. However, performance differences were not consistent across all domains and should be interpreted within the limitations of a structured MCQ-based assessment. LLMs may serve as complementary educational tools in postgraduate prosthodontic training; however, they should not be considered substitutes for supervised clinical education or as direct indicators of clinical competence.

Author Contributions

Conceptualization and methodology: G.A. and A.C.B.; Investigation and data acquisition: A.C.B.; Formal analysis and interpretation of data: G.A. and A.C.B.; Writing—original draft preparation: G.A.; Writing—review and editing: G.A. and A.C.B.; Supervision: G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Çankırı Karatekin University Health Sciences Ethics Committee (Meeting No: 17, Approval Date: 4 December 2024).

Informed Consent Statement

Informed consent was obtained from all participants involved in the study.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
LLMs	Large Language Models

References

Helm, J.M.; Swiergosz, A.M.; Haeberle, H.S.; Karnuta, J.M.; Schaffer, J.L.; Krebs, V.E.; Spitzer, A.I.; Ramkumar, P.N. Machine learning and artificial intelligence: Definitions, applications, and future directions. Curr. Rev. Musculoskelet. Med. 2020, 13, 69–76. [Google Scholar] [CrossRef] [PubMed]
Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
Schwendicke, F.; Samek, W.; Krois, J. Artificial intelligence in dentistry: Chances and challenges. J. Dent. Res. 2020, 99, 769–774. [Google Scholar] [CrossRef] [PubMed]
Eggmann, F.; Weiger, R.; Zitzmann, N.U.; Blatz, M.B. Implications of large language models such as ChatGPT for dental medicine. J. Esthet. Restor. Dent. 2023, 35, 1098–1102. [Google Scholar] [CrossRef]
Zeng, Z.; Xu, J. AI demands a different approach to education. Nature 2025, 639, 577. [Google Scholar] [CrossRef]
Nguyen, T.T.; Larrivée, N.; Lee, A.; Bilaniuk, O.; Durand, R. Use of artificial intelligence in dentistry: Current clinical trends and research advances. J. Can. Dent. Assoc. 2021, 87, l7. [Google Scholar] [CrossRef]
Carrillo-Pérez, F.; Pecho, O.E.; Morales, J.C.; Paravina, R.D.; Della Bona, A.; Ghinea, R.; Pulgar, R.; Pérez, M.D.M.; Herrera, L.J. Applications of artificial intelligence in dentistry: A comprehensive review. J. Esthet. Restor. Dent. 2022, 34, 259–280. [Google Scholar] [CrossRef]
Thurzo, A.; Strunga, M.; Urban, R.; Surovková, J.; Afrashtehfar, K.I. Impact of artificial intelligence on dental education: A review and guide for curriculum update. Educ. Sci. 2023, 13, 150. [Google Scholar] [CrossRef]
Shete, A.; Shete, M.; Chavan, M.; Channe, P.; Sapkal, R.; Buva, K. Evaluation of ChatGPT as a new assessment tool in dental education. J. Indian Acad. Oral Med. Radiol. 2024, 36, 259–263. [Google Scholar] [CrossRef]
Saravia-Rojas, M.Á.; Camarena-Fonseca, A.R.; León-Manco, R.; Geng-Vivanco, R. Artificial intelligence: ChatGPT as a disruptive didactic strategy in dental education. J. Dent. Educ. 2024, 88, 872–876. [Google Scholar] [CrossRef]
Revilla-León, M.; Barmak, B.A.; Sailer, I.; Kois, J.C.; Att, W. Performance of an artificial intelligence-based chatbot (ChatGPT) answering the European Certification in Implant Dentistry exam. Int. J. Prosthodont. 2024, 37, 221–224. [Google Scholar] [CrossRef]
Friederichs, H.; Friederichs, W.J.; März, M. ChatGPT in medical school: How successful is AI in progress testing? Med. Educ. Online 2023, 28, 2220920. [Google Scholar] [CrossRef] [PubMed]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
Schwendicke, F.; Chaurasia, A.; Wiegand, T.; Uribe, S.E.; Fontana, M.; Akota, I.; Tryfonos, O.; Krois, J.; IADR E-Oral Health Network and the ITU/WHO Focus Group AI for Health. Artificial intelligence for oral and dental healthcare: Core education curriculum. J. Dent. 2023, 128, 104363. [Google Scholar] [CrossRef] [PubMed]
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
Huh, S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. J. Educ. Eval. Health Prof. 2023, 20, 1. [Google Scholar] [CrossRef]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E. ChatGPT for good? on opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Türkiye Cumhuriyeti Resmî Gazete. Tıpta ve diş Hekimliğinde Uzmanlık Eğitimi Yönetmeliği. Resmî Gazete Tarihi: 03.09.2022; Sayı: 31942; Madde 19/5. Available online: https://www.mevzuat.gov.tr/mevzuat?MevzuatNo=39700&MevzuatTur=7&MevzuatTertip=5 (accessed on 20 December 2025).
Yalçın, O.; Topal, C.G. Diş hekimliği uzmanlık eğitimi giriş sınavında sorulan ağız, diş ve çene radyolojisi sorularının retrospektif analizi. J. Int. Dent. Sci. 2025, 11, 193–201. [Google Scholar] [CrossRef]
Bulut, A.C.; Bahadır, H.S.; Ateş, G. Artificial intelligence in dental education: Can AI-based chatbots compete with general practitioners? BMC Med. Educ. 2025, 25, 1319. [Google Scholar] [CrossRef]
Carrillo-Ávalos, B.A.; Leenen, I.; Trejo-Mejía, J.A.; Sánchez-Mendiola, M. Bridging validity frameworks in assessment: Beyond traditional approaches in health professions education. Teach. Learn. Med. 2025, 37, 229–238. [Google Scholar] [CrossRef]
Yaş, S.; Ahmadov, A.; Baymurat, A.; Tokgöz, M.; Yaş, S.; Odluyurt, M.; Tolunay, T. ChatGPT vs. orthopedic residents! who is the winner? Gazi Med. J. 2024, 35, 186–191. [Google Scholar] [CrossRef]
Rudolph, J.; Tan, S.; Tan, S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? J. Appl. Learn. Teach. 2023, 6, 342–363. [Google Scholar] [CrossRef]
Salem, M.; Karasan, D.; Revilla-León, M.; Barmak, A.B.; Sailer, I. Performance of artificial intelligence-based chatbots (ChatGPT-3.5 and ChatGPT-4.0) answering the International Team of Implantology exam questions. J. Esthet. Restor. Dent. 2025, 37, 2412–2416. [Google Scholar] [CrossRef] [PubMed]
Schoonheim-Klein, M.E.; van Selms, M.K.; Volgenant, C.M.; Wiegman, H.P.; Vervoorn, J.M. Het beoordelen van de klinische competenties van studenten tandheelkunde [Assessing the clinical competence of dental students]. Ned. Tijdschr. Tandheelkd. 2012, 119, 328–336. [Google Scholar] [CrossRef] [PubMed][Green Version]
Artino, A.R., Jr.; Gilliland, W.R.; Waechter, D.M.; Cruess, D.; Calloway, M.; Durning, S.J. Does self-reported clinical experience predict performance in medical school and internship? Med. Educ. 2012, 46, 172–178. [Google Scholar] [CrossRef]
Henzi, D.; Davis, E.; Jasinevicius, R.; Hendricson, W.; Cintron, L.; Isaacs, M. Appraisal of the dental school learning environment: The students’ view. J. Dent. Educ. 2005, 69, 1137–1147. [Google Scholar] [CrossRef]
Miller, G.E. The assessment of clinical skills/competence/performance. Acad. Med. 1990, 65, S63–S67. [Google Scholar] [CrossRef]
El-Hakim, M.; Anthonappa, R.; Fawzy, A. Artificial intelligence in dental education: A scoping review of applications, challenges, and gaps. Dent. J. 2025, 13, 384. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
Epstein, R.M. Assessment in medical education. N. Engl. J. Med. 2007, 356, 387–396. [Google Scholar] [CrossRef]

Table 1. Demographic characteristics of the prosthodontic residents (n = 32).

Variable	n	%
Gender
Female	20	62.5
Male	12	37.5
Specialty Training Year
0.5 year	7	21.9
1 year	7	21.9
2 years	10	31.3
2.5 years	1	3.1
3 years	7	21.9

Table 2. Resident-level performance and LLM benchmark scores across domains.

Domain	Residents Mean ± SD	Median (IQR)	95% CI	ChatGPT-4o	ChatGPT-o1	ChatGPT-o3-mini	Claude Sonnet 3.7	Gemini	Microsoft Copilot	DeepSeek
Basic prosthodontic knowledge (0–8)	4.84 ± 1.44	5 (4–6)	4.32–5.36	6/8 (75.0%)	8/8 (100%)	6/8 (75.0%)	7/8 (87.5%)	6/8 (75.0%)	7/8 (87.5%)	6/8 (75.0%)
Advanced prosthodontic specialty knowledge (0–8)	2.91 ± 1.28	3 (2–4)	2.45–3.37	6/8 (75.0%)	6/8 (75.0%)	7/8 (87.5%)	7/8 (87.5%)	6/8 (75.0%)	7/8 (87.5%)	6/8 (75.0%)
General dentistry (0–10)	6.75 ± 1.46	6.5 (5.75–8.00)	6.22–7.28	5/10 (50.0%)	6/10 (60.0%)	5/10 (50.0%)	7/10 (70.0%)	5/10 (50.0%)	6/10 (60.0%)	6/10 (60.0%)
Total (0–26)	14.50 ± 2.88	15 (13–17)	13.46–15.54	17/26 (65.4%)	20/26 (76.9%)	18/26 (69.2%)	21/26 (80.8%)	17/26 (65.4%)	20/26 (76.9%)	18/26 (69.2%)

Abbreviations: SD, standard deviation; IQR, interquartile range; CI, confidence interval; LLM, large language model.

Table 3. Distribution of correct and incorrect responses across study groups and the association between them.

Question Category	Group	Incorrect n (%)	Column %	Correct n (%)	Column %	Test Statistic	p Value
Basic knowledge	Human	82 (34.6)	89.1	155 (65.4)	77.1	7.346	0.365
	ChatGPT-4o	2 (25.0)	2.2	6 (75.0)	3.0
	ChatGPT-o1	0 (0.0)	0.0	8 (100.0)	4.0
	ChatGPT-o3-mini	2 (25.0)	2.2	6 (75.0)	3.0
	Claude Sonnet 3.7	1 (12.5)	1.1	7 (87.5)	3.5
	Gemini	2 (25.0)	2.2	6 (75.0)	3.0
	Microsoft Copilot	1 (12.5)	1.1	7 (87.5)	3.5
	DeepSeek	2 (25.0)	2.2	6 (75.0)	3.0
Advanced prosthodontic specialty	Human	131 ^a (58.5)	92.3	93 ^b (41.5)	67.4	27.175	<0.001 *
	ChatGPT-4o	2 ^a (25.0)	1.4	6 ^a (75.0)	4.3
	ChatGPT-o1	2 ^a (25.0)	1.4	6 ^a (75.0)	4.3
	ChatGPT-o3-mini	1 ^a (12.5)	0.7	7 ^b (87.5)	5.1
	Claude Sonnet 3.7	1 ^a (12.5)	0.7	7 ^b (87.5)	5.1
	Gemini	2 ^a (25.0)	1.4	6 ^a (75.0)	4.3
	Microsoft Copilot	1 ^a (12.5)	0.7	7 ^b (87.5)	5.1
	DeepSeek	2 ^a (25.0)	1.4	6 ^a (75.0)	4.3
General dentistry	Human	94 (30.3)	75.8	216 (69.7)	84.4	6.302	0.501
	ChatGPT-4o	5 (50.0)	4.0	5 (50.0)	2.0
	ChatGPT-o1	4 (40.0)	3.2	6 (60.0)	2.3
	ChatGPT-o3-mini	5 (50.0)	4.0	5 (50.0)	2.0
	Claude Sonnet 3.7	3 (30.0)	2.4	7 (70.0)	2.7
	Gemini	5 (50.0)	4.0	5 (50.0)	2.0
	Microsoft Copilot	4 (40.0)	3.2	6 (60.0)	2.3
	DeepSeek	4 (40.0)	3.2	6 (60.0)	2.3
Total	Human	307 (39.8)	85.8	464 (60.2)	78.0	11.266	0.127
	ChatGPT-4o	9 (34.6)	2.5	17 (65.4)	2.9
	ChatGPT-o1	6 (23.1)	1.7	20 (76.9)	3.4
	ChatGPT-o3-mini	8 (30.8)	2.2	18 (69.2)	3.0
	Claude Sonnet 3.7	5 (19.2)	1.4	21 (80.8)	3.5
	Gemini	9 (34.6)	2.5	17 (65.4)	2.9
	Microsoft Copilot	6 (23.1)	1.7	20 (76.9)	3.4
	DeepSeek	8 (30.8)	2.2	18 (69.2)	3.0

* p < 0.05. Column % indicates column-based percentage distribution. Different superscript letters indicate statistically significant differences between column proportions.

Table 4. Effect size analysis of the association between study groups and response accuracy across question categories (Cramér’s V).

Question Category	Cramér’s V	p Value	95% CI (Lower)	95% CI (Upper)
Basic knowledge	0.166	0.328	0.000	0.227
Advanced prosthodontic specialty	0.316	<0.001 *	0.150	0.403
General dentistry	0.122	0.582	0.000	0.166
Total	0.109	0.127	0.000	0.147

* p < 0.05.

Table 5. Association between year of specialty training and resident performance (Kendall’s Tau-b analysis).

Outcome	Kendall’s Tau-b	p Value
Basic prosthodontic knowledge score	0.059	0.690
Advanced prosthodontic specialty score	0.129	0.383
General dentistry score	0.126	0.389
Total score	0.124	0.379

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ates, G.; Bulut, A.C. A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions. Appl. Sci. 2026, 16, 3296. https://doi.org/10.3390/app16073296

AMA Style

Ates G, Bulut AC. A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions. Applied Sciences. 2026; 16(7):3296. https://doi.org/10.3390/app16073296

Chicago/Turabian Style

Ates, Gül, and Ali Can Bulut. 2026. "A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions" Applied Sciences 16, no. 7: 3296. https://doi.org/10.3390/app16073296

APA Style

Ates, G., & Bulut, A. C. (2026). A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions. Applied Sciences, 16(7), 3296. https://doi.org/10.3390/app16073296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Cross-Sectional Study of Prosthodontic Residents and Large Language Models on Standardized Multiple-Choice Questions

Abstract

1. Introduction

2. Materials and Methods

Statistical Analysis

3. Results

3.1. Resident-Level Performance

3.2. LLM Benchmark Performance

3.3. Comparison of Response Accuracy Across Study Groups

3.4. Effect Size Analysis

3.5. Association with Year of Training

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI