- Article
Assessing the Efficacy of Artificial Intelligence Platforms in Answering Dental Caries Multiple-Choice Questions: A Comparative Study of ChatGPT and Google Gemini Language Models
- Amr Ahmed Azhari,
- Walaa Magdy Ahmed and
- Chang-Tien Lu
- + 3 authors
Objective: This study aimed to compare the accuracy of two large language models (LLMs)—ChatGPT (version 3.5) and Google Gemini (formerly Bard)—in answering dental caries-related multiple-choice questions (MCQs) using a simulated student examination framework across seven examination lengths. Materials and Methods: A total of 125 validated dental caries MCQs were extracted from Dental Decks and Oxford University Press question banks. Seven examination groups were constructed with varying question counts (25, 35, 45, 55, 65, 75, and 85 questions). For each group, 100 simulations were generated per LLM (ChatGPT and Gemini), resulting in 1400 simulated examinations. Each simulated student received a unique randomized subset of questions. MCQs were answered by each LLM using a standardized prompt to minimize ambiguity. Outcomes included mean score, passing rate (≥60%), and performance differences between LLMs. Statistical analyses included independent t-tests, one-way ANOVA within each LLM, and two-way ANOVA examining interactions between LLM type and question count. Results: Across all seven examination formats, Gemini significantly outperformed ChatGPT (p < 0.001). Gemini achieved higher passing rates and higher mean scores in every examination length. One-way ANOVA revealed significant score variation with increasing exam length for both LLMs (p < 0.05). Two-way ANOVA demonstrated significant main effects of LLM type and question count, with no significant interaction. Randomization had no measurable effect on Gemini performance but influenced ChatGPT scores. Conclusions: Gemini demonstrated superior accuracy and higher passing rates compared to ChatGPT in all simulated examination formats. While both LLMs struggled with complex caries-related content, Gemini provided more reliable performance across question quantities. Educators should exercise caution in relying on LLMs for automated assessment or self-study, and future research should evaluate human–AI hybrid models and LLM performance across broader dental domains.
27 January 2026




